Plotting health data with python

In this notebook we use publicly available health care data to illustrate the use of some python libraries. We will plot health care expenditure per municipality and how health care varies with age. In the process you will see how libraries like pandas and matplotlib work.

1 Health care expenditures per municipality

We start with health care data on the level of municipalities. We will plot how health care expenditure varies with location.

Some institutional background:

  • in the Netherlands health insurance is mandatory
  • people aged 18 or older face a deductible of 365 euros in 2014; no deductible for people below 18
  • the deductible applies to a number of cost categories in the "basic packages"; see below
  • people can –voluntarily– increase their deductible with 100, 200, 300, 400 or 500 euro; but our data has no information on this

1.1 loading the data

The following code block loads the data. We use data from Vektis. We import the data, which is a 'csv' file with ";" as separator between columns. We also import the pandas library.

Then we inspect the first ten rows of our data set.

# First lets clear all previous python imports and variables by resetting the python kernel.
%reset -f


import pandas as pd

df_gemeente = pd.read_csv('Vektis_Open_Databestand_Zorgverzekeringswet_2014_-_gemeente.csv', sep = ';')

df_gemeente.head(10)

GESLACHT LEEFTIJDSKLASSE GEMEENTENAAM AANTAL_BSN AANTAL_VERZEKERDEJAREN \ 0 NaN NaN NaN 298383 185664.92 1 M 0 t/m 4 jaar AA EN HUNZE 507 468.83 2 M 0 t/m 4 jaar AALBURG 428 387.28 3 M 0 t/m 4 jaar AALSMEER 876 790.65 4 M 0 t/m 4 jaar AALTEN 667 600.00 5 M 0 t/m 4 jaar ACHTKARSPELEN 810 737.03 6 M 0 t/m 4 jaar ALBLASSERDAM 637 561.21 7 M 0 t/m 4 jaar ALBRANDSWAARD 735 663.22 8 M 0 t/m 4 jaar ALKMAAR 2967 2660.93 9 M 0 t/m 4 jaar ALMELO 1906 1714.63

KOSTEN_MEDISCH_SPECIALISTISCHE_ZORG KOSTEN_FARMACIE \ 0 48661669.94 9219422.33 1 473370.77 43305.49 2 340442.37 23395.86 3 932841.57 54597.52 4 515879.78 43516.71 5 766456.80 88084.16 6 1252440.29 140225.60 7 927635.48 114174.07 8 4005510.06 258936.51 9 3518922.31 254976.19

KOSTEN_TWEEDELIJNS_GGZ KOSTEN_HUISARTS_INSCHRIJFTARIEF \ 0 7475481.90 4092492.35 1 14056.81 26239.56 2 5555.83 21673.15 3 34355.66 43971.27 4 2404.19 33749.10 5 6681.81 41477.79 6 9715.96 30705.39 7 2043.52 37028.38 8 22599.87 147227.49 9 87437.11 96757.71

KOSTEN_HUISARTS_CONSULT … KOSTEN_PARAMEDISCHE_ZORG_OVERIG \ 0 1388439.07 … 290539.66 1 12829.29 … 13783.67 2 9417.98 … 15297.60 3 22413.37 … 14105.95 4 18472.64 … 31816.27 5 19157.34 … 30454.49 6 15936.39 … 19202.75 7 21611.09 … 30449.94 8 71141.10 … 67205.26 9 51693.62 … 81595.14

KOSTEN_ZIEKENVERVOER_ZITTEND KOSTEN_ZIEKENVERVOER_LIGGEND \ 0 210313.33 1398151.05 1 638.08 5004.86 2 315.06 6155.09 3 0.00 20465.44 4 1173.82 6282.28 5 204.30 17011.84 6 0.00 11421.05 7 0.00 13027.62 8 4086.51 70806.48 9 7742.31 32328.75

KOSTEN_KRAAMZORG KOSTEN_VERLOSKUNDIGE_ZORG \ 0 1286545.5 1072906.37 1 0.0 0.00 2 0.0 0.00 3 0.0 0.00 4 0.0 0.00 5 0.0 0.00 6 0.0 0.00 7 0.0 0.00 8 0.0 0.00 9 0.0 0.00

KOSTEN_GENERALISTISCHE_BASIS_GGZ KOSTEN_GRENSOVERSCHRIJDENDE_ZORG \ 0 490222.49 21946526.34 1 0.00 2571.36 2 0.00 186.08 3 157.44 1743.72 4 1422.67 62.75 5 0.00 206.11 6 0.00 662.75 7 0.00 1422.76 8 168.55 7153.37 9 3777.82 2492.23

KOSTEN_EERSTELIJNS_ONDERSTEUNING KOSTEN_GERIATRISCHE_REVALIDATIEZORG \ 0 523.35 351533.29 1 31.05 0.00 2 5.40 0.00 3 12.60 0.00 4 3.60 0.00 5 955.80 0.00 6 11.25 0.00 7 7.20 0.00 8 36.90 0.00 9 27.45 0.00

KOSTEN_OVERIG 0 849751.44 1 186.03 2 135.30 3 13285.23 4 107.40 5 453.66 6 165.60 7 368.37 8 2707.21 9 682.51

[10 rows x 24 columns]

GESLACHT LEEFTIJDSKLASSE GEMEENTENAAM AANTAL_BSN AANTAL_VERZEKERDEJAREN KOSTEN_MEDISCH_SPECIALISTISCHE_ZORG KOSTEN_FARMACIE KOSTEN_TWEEDELIJNS_GGZ KOSTEN_HUISARTS_INSCHRIJFTARIEF KOSTEN_HUISARTS_CONSULT ... KOSTEN_PARAMEDISCHE_ZORG_OVERIG KOSTEN_ZIEKENVERVOER_ZITTEND KOSTEN_ZIEKENVERVOER_LIGGEND KOSTEN_KRAAMZORG KOSTEN_VERLOSKUNDIGE_ZORG KOSTEN_GENERALISTISCHE_BASIS_GGZ KOSTEN_GRENSOVERSCHRIJDENDE_ZORG KOSTEN_EERSTELIJNS_ONDERSTEUNING KOSTEN_GERIATRISCHE_REVALIDATIEZORG KOSTEN_OVERIG
0 NaN NaN NaN 298383 185664.92 48661669.94 9219422.33 7475481.90 4092492.35 1388439.07 ... 290539.66 210313.33 1398151.05 1286545.5 1072906.37 490222.49 21946526.34 523.35 351533.29 849751.44
1 M 0 t/m 4 jaar AA EN HUNZE 507 468.83 473370.77 43305.49 14056.81 26239.56 12829.29 ... 13783.67 638.08 5004.86 0.0 0.00 0.00 2571.36 31.05 0.00 186.03
2 M 0 t/m 4 jaar AALBURG 428 387.28 340442.37 23395.86 5555.83 21673.15 9417.98 ... 15297.60 315.06 6155.09 0.0 0.00 0.00 186.08 5.40 0.00 135.30
3 M 0 t/m 4 jaar AALSMEER 876 790.65 932841.57 54597.52 34355.66 43971.27 22413.37 ... 14105.95 0.00 20465.44 0.0 0.00 157.44 1743.72 12.60 0.00 13285.23
4 M 0 t/m 4 jaar AALTEN 667 600.00 515879.78 43516.71 2404.19 33749.10 18472.64 ... 31816.27 1173.82 6282.28 0.0 0.00 1422.67 62.75 3.60 0.00 107.40
5 M 0 t/m 4 jaar ACHTKARSPELEN 810 737.03 766456.80 88084.16 6681.81 41477.79 19157.34 ... 30454.49 204.30 17011.84 0.0 0.00 0.00 206.11 955.80 0.00 453.66
6 M 0 t/m 4 jaar ALBLASSERDAM 637 561.21 1252440.29 140225.60 9715.96 30705.39 15936.39 ... 19202.75 0.00 11421.05 0.0 0.00 0.00 662.75 11.25 0.00 165.60
7 M 0 t/m 4 jaar ALBRANDSWAARD 735 663.22 927635.48 114174.07 2043.52 37028.38 21611.09 ... 30449.94 0.00 13027.62 0.0 0.00 0.00 1422.76 7.20 0.00 368.37
8 M 0 t/m 4 jaar ALKMAAR 2967 2660.93 4005510.06 258936.51 22599.87 147227.49 71141.10 ... 67205.26 4086.51 70806.48 0.0 0.00 168.55 7153.37 36.90 0.00 2707.21
9 M 0 t/m 4 jaar ALMELO 1906 1714.63 3518922.31 254976.19 87437.11 96757.71 51693.62 ... 81595.14 7742.31 32328.75 0.0 0.00 3777.82 2492.23 27.45 0.00 682.51

10 rows × 24 columns

We are going to be interested in total health care costs under the deductible. Hence, we specify the categories that fall under the deductible in the Netherlands in 2014. Further, we want to do this analysis in English. So we relabel the relevant column names into English.

We specify a list of columns (cost categories) across which we want to add costs. As the costs are total costs (not costs per head) we can indeed add these numbers. We leave out mental health care as it features its own financial incentives.

We specify the name of a new variable health_expenditure_under_deductible which is calculated on the list of columns specified and then the function that needs to be applied – sum in this case. Finally, we specify that the function needs to be applied row-wise: axis=1.

Then we give a dictionary with "old" variable names (in Dutch) and the "new" names in English for the variables where we want to use the English names.

cost_categories_under_deductible = ['KOSTEN_MEDISCH_SPECIALISTISCHE_ZORG', 'KOSTEN_MONDZORG', 'KOSTEN_FARMACIE', 'KOSTEN_HULPMIDDELEN', 'KOSTEN_PARAMEDISCHE_ZORG_FYSIOTHERAPIE', 'KOSTEN_PARAMEDISCHE_ZORG_OVERIG', 'KOSTEN_ZIEKENVERVOER_ZITTEND', 'KOSTEN_ZIEKENVERVOER_LIGGEND', 'KOSTEN_GRENSOVERSCHRIJDENDE_ZORG', 'KOSTEN_GERIATRISCHE_REVALIDATIEZORG', 'KOSTEN_OVERIG']
df_gemeente['health_expenditure_under_deductible'] = df_gemeente[cost_categories_under_deductible].sum(axis=1)

df_gemeente = df_gemeente.rename_axis({
'GESLACHT':'sex',
'LEEFTIJDSKLASSE':'age',
'GEMEENTENAAM':'MUNICIPALITY',
'AANTAL_BSN':'number_citizens',
'KOSTEN_MEDISCH_SPECIALISTISCHE_ZORG':'hospital_care',
'KOSTEN_FARMACIE':'pharmaceuticals',
'KOSTEN_TWEEDELIJNS_GGZ':'mental_care',
'KOSTEN_HUISARTS_INSCHRIJFTARIEF':'GP_capitation',
'KOSTEN_HUISARTS_CONSULT':'GP_fee_for_service',
'KOSTEN_HUISARTS_OVERIG':'GP_other',
'KOSTEN_MONDZORG':'dental care',
'KOSTEN_PARAMEDISCHE_ZORG_FYSIOTHERAPIE':'physiotherapy',
'KOSTEN_KRAAMZORG':'maternity_care',
'KOSTEN_VERLOSKUNDIGE_ZORG':'obstetrics'
}, axis='columns')

df_gemeente.dtypes

We drop the columns that we no longer need.

df_gemeente.drop(['AANTAL_VERZEKERDEJAREN',
'KOSTEN_HULPMIDDELEN',
'KOSTEN_PARAMEDISCHE_ZORG_OVERIG',
'KOSTEN_ZIEKENVERVOER_ZITTEND',
'KOSTEN_ZIEKENVERVOER_LIGGEND',
'KOSTEN_GRENSOVERSCHRIJDENDE_ZORG',
'KOSTEN_GERIATRISCHE_REVALIDATIEZORG',
'KOSTEN_OVERIG',
'KOSTEN_GENERALISTISCHE_BASIS_GGZ',
'KOSTEN_EERSTELIJNS_ONDERSTEUNING'],inplace=True,axis=1)
df_gemeente.columns

Let's look again at the first couple of lines of our data set. The head method presents the first 5 line as default.

df_gemeente.head()

We are not interested in the first line, so we drop it. Indeed, our data set now starts with the first municipality 'AA EN HUNZE'.

df_gemeente.drop(df_gemeente.index[[0]], inplace=True)
df_gemeente.head()

sex age MUNICIPALITY number_citizens hospital_care \ 1 M 0 t/m 4 jaar AA EN HUNZE 507 473370.77 2 M 0 t/m 4 jaar AALBURG 428 340442.37 3 M 0 t/m 4 jaar AALSMEER 876 932841.57 4 M 0 t/m 4 jaar AALTEN 667 515879.78 5 M 0 t/m 4 jaar ACHTKARSPELEN 810 766456.80

pharmaceuticals mental_care GP_capitation GP_fee_for_service GP_other \ 1 43305.49 14056.81 26239.56 12829.29 36034.65 2 23395.86 5555.83 21673.15 9417.98 20159.19 3 54597.52 34355.66 43971.27 22413.37 61629.32 4 43516.71 2404.19 33749.10 18472.64 46720.61 5 88084.16 6681.81 41477.79 19157.34 53633.01

dental care physiotherapy maternity_care obstetrics \ 1 9311.14 15968.80 0.0 0.0 2 7213.81 6135.05 0.0 0.0 3 19042.00 20086.43 0.0 0.0 4 12909.41 20762.75 0.0 0.0 5 16695.10 23423.96 0.0 0.0

health_expenditure_under_deductible 1 576750.31 2 406856.55 3 1093297.37 4 652523.86 5 954494.16

Now let's consider data types.

df_gemeente.dtypes

sex object age object MUNICIPALITY object number_citizens int64 hospital_care float64 pharmaceuticals float64 mental_care float64 GP_capitation float64 GP_fee_for_service float64 GP_other float64 dental care float64 physiotherapy float64 maternity_care float64 obstetrics float64 health_expenditure_under_deductible float64 dtype: object

The first three variables are seen as "object", that is, strings. This is fine for MUNICIPALITY but is not quite right for sex and age as these are categories. So let's relabel their types.

df_gemeente['sex'] = df_gemeente['sex'].astype('category')
df_gemeente['age'] = df_gemeente['age'].astype('category')
df_gemeente.info()

<class 'pandas.core.frame.DataFrame'> Int64Index: 14808 entries, 1 to 14808 Data columns (total 15 columns): sex 14808 non-null category age 14808 non-null category MUNICIPALITY 14808 non-null object number_citizens 14808 non-null int64 hospital_care 14808 non-null float64 pharmaceuticals 14808 non-null float64 mental_care 14808 non-null float64 GP_capitation 14808 non-null float64 GP_fee_for_service 14808 non-null float64 GP_other 14808 non-null float64 dental care 14808 non-null float64 physiotherapy 14808 non-null float64 maternity_care 14808 non-null float64 obstetrics 14808 non-null float64 health_expenditure_under_deductible 14808 non-null float64 dtypes: category(2), float64(11), int64(1), object(1) memory usage: 1.6+ MB

We can select rows from a dataframe using loc. Below, we consider the municipality Breda and people between 10 and 14 years old.

df_gemeente.loc[(df_gemeente.MUNICIPALITY == 'BREDA') & (df_gemeente['age'] == '10 t/m 14 jaar')]

sex age MUNICIPALITY number_citizens hospital_care \ 832 M 10 t/m 14 jaar BREDA 5206 2215947.11 8234 V 10 t/m 14 jaar BREDA 4915 1425550.97

pharmaceuticals mental_care GP_capitation GP_fee_for_service \ 832 381799.92 920439.00 301494.04 88705.2 8234 255232.99 564944.21 284269.76 91482.6

GP_other dental care physiotherapy maternity_care obstetrics \ 832 142402.72 677836.15 187429.87 0.0 0.0 8234 134486.26 620303.59 202916.68 0.0 0.0

health_expenditure_under_deductible 832 3756487.05 8234 2795294.25

sex age MUNICIPALITY number_citizens hospital_care pharmaceuticals mental_care GP_capitation GP_fee_for_service GP_other dental care physiotherapy maternity_care obstetrics health_expenditure_under_deductible
832 M 10 t/m 14 jaar BREDA 5206 2215947.11 381799.92 920439.00 301494.04 88705.2 142402.72 677836.15 187429.87 0.0 0.0 3756487.05
8234 V 10 t/m 14 jaar BREDA 4915 1425550.97 255232.99 564944.21 284269.76 91482.6 134486.26 620303.59 202916.68 0.0 0.0 2795294.25

If we are interested in one variable, e.g. number_citizens, we can select this as well.

df_gemeente.loc[(df_gemeente.MUNICIPALITY == 'BREDA') & (df_gemeente['age'] == '10 t/m 14 jaar')]['number_citizens']

Exercise

Calculate how many citizens live in Breda between the age of 10 and 14.


We can also give lists of row values that we are interested in:

df_gemeente.loc[(df_gemeente.MUNICIPALITY.isin(['BREDA', 'TILBURG'])) & (df_gemeente['age'] == '10 t/m 14 jaar')]['number_citizens']

832 5206 1098 5817 8234 4915 8500 5651 Name: number_citizens, dtype: int64


Exercise

Calculate total health care expenditures under the deductible for people living in Amsterdam between the ages of 10 and 19 years old.


Now that we have the data ready, we are going to plot health care expenditures on the map of the Netherlands.

1.2 geographical figures

We have map data that links the name of a municipality to coordinates on the map. In this map data, the names of municipalities are capitalized under standard Dutch capitalization like "Aa en Hunze". In our Vektis data, the names of municipalities are written in capitals. There are a number of ways to resolve this. To illustrate the merge command, we use a file with two columns: 1. the names of municipalities all capitalized and 2. normal capitalization. We drop the rows where there is no value for municipality (if such rows exist). We merge our data df_gemeente with the dataframe Gemeentes. We use a "left-merge", so all rows in the first dataframe df_gemeente are kept.

Gemeentes = pd.read_excel('Gemeentes.xlsx')
df_gemeente = df_gemeente.dropna(subset=['MUNICIPALITY'])
df_gem_merged = pd.merge(df_gemeente,Gemeentes,on=['MUNICIPALITY'],how='left')
df_gem_merged.head()
sex age MUNICIPALITY number_citizens hospital_care pharmaceuticals mental_care GP_capitation GP_fee_for_service GP_other dental care physiotherapy maternity_care obstetrics health_expenditure_under_deductible Municipality
0 M 0 t/m 4 jaar AA EN HUNZE 507 473370.77 43305.49 14056.81 26239.56 12829.29 36034.65 9311.14 15968.80 0.0 0.0 576750.31 Aa en Hunze
1 M 0 t/m 4 jaar AALBURG 428 340442.37 23395.86 5555.83 21673.15 9417.98 20159.19 7213.81 6135.05 0.0 0.0 406856.55 Aalburg
2 M 0 t/m 4 jaar AALSMEER 876 932841.57 54597.52 34355.66 43971.27 22413.37 61629.32 19042.00 20086.43 0.0 0.0 1093297.37 Aalsmeer
3 M 0 t/m 4 jaar AALTEN 667 515879.78 43516.71 2404.19 33749.10 18472.64 46720.61 12909.41 20762.75 0.0 0.0 652523.86 Aalten
4 M 0 t/m 4 jaar ACHTKARSPELEN 810 766456.80 88084.16 6681.81 41477.79 19157.34 53633.01 16695.10 23423.96 0.0 0.0 954494.16 Achtkarspelen

sex age MUNICIPALITY number_citizens hospital_care \ 0 M 0 t/m 4 jaar AA EN HUNZE 507 473370.77 1 M 0 t/m 4 jaar AALBURG 428 340442.37 2 M 0 t/m 4 jaar AALSMEER 876 932841.57 3 M 0 t/m 4 jaar AALTEN 667 515879.78 4 M 0 t/m 4 jaar ACHTKARSPELEN 810 766456.80

pharmaceuticals mental_care GP_capitation GP_fee_for_service GP_other \ 0 43305.49 14056.81 26239.56 12829.29 36034.65 1 23395.86 5555.83 21673.15 9417.98 20159.19 2 54597.52 34355.66 43971.27 22413.37 61629.32 3 43516.71 2404.19 33749.10 18472.64 46720.61 4 88084.16 6681.81 41477.79 19157.34 53633.01

dental care physiotherapy maternity_care obstetrics \ 0 9311.14 15968.80 0.0 0.0 1 7213.81 6135.05 0.0 0.0 2 19042.00 20086.43 0.0 0.0 3 12909.41 20762.75 0.0 0.0 4 16695.10 23423.96 0.0 0.0

health_expenditure_under_deductible Municipality 0 576750.31 Aa en Hunze 1 406856.55 Aalburg 2 1093297.37 Aalsmeer 3 652523.86 Aalten 4 954494.16 Achtkarspelen

We are going to plot expenditure under the deductible per head for each municipality. We need a couple of steps in order to do this:

  1. we add –for each municipality– the expenditures under the deductible across age groups;
  2. we add –for each municipality– the number of people across age groups;
  3. we divide –for each municipality– the expenditures by the number of people.

With pandas this is straightforward to do using groupby. We do the groupby on the municipality. For each municipality there are different age groups and we need to aggregate over these age groups. We specify the variables that we want to know at the municipality level. In this case health_expenditure_under_deductible and number_citizens. Finally, we specify the function with which to aggregate. Here we use the 'built-in' function sum(). Other functions we can use include mean, min, max etc. You can also specify your own function and apply this using agg().

Then hospital care per head can be defined as the total expenditure per municipality divided by the total number of citizens per municipality.

results = df_gem_merged.groupby('Municipality')[['health_expenditure_under_deductible','number_citizens']].sum()
results['expenditure_per_head'] = results['health_expenditure_under_deductible']/results['number_citizens']
results.head()
health_expenditure_under_deductible number_citizens expenditure_per_head
Municipality
's-Gravenhage 9.115807e+08 509206 1790.200148
's-Hertogenbosch 2.697598e+08 151934 1775.506394
Aa en Hunze 4.696741e+07 25415 1848.019292
Aalburg 2.189368e+07 12871 1701.008568
Aalsmeer 4.864200e+07 31365 1550.837018

health_expenditure_under_deductible number_citizens \ Municipality 's-Gravenhage 9.115807e+08 509206 's-Hertogenbosch 2.697598e+08 151934 Aa en Hunze 4.696741e+07 25415 Aalburg 2.189368e+07 12871 Aalsmeer 4.864200e+07 31365

expenditure_per_head Municipality 's-Gravenhage 1790.200148 's-Hertogenbosch 1775.506394 Aa en Hunze 1848.019292 Aalburg 1701.008568 Aalsmeer 1550.837018

# Create a dataframe
p_results = pd.DataFrame(dict(
    municipality = results.index,
    expenditure = results['expenditure_per_head']
    ))

p_results.head()
expenditure municipality
Municipality
's-Gravenhage 1790.200148 's-Gravenhage
's-Hertogenbosch 1775.506394 's-Hertogenbosch
Aa en Hunze 1848.019292 Aa en Hunze
Aalburg 1701.008568 Aalburg
Aalsmeer 1550.837018 Aalsmeer

expenditure municipality Municipality 's-Gravenhage 1790.200148 's-Gravenhage 's-Hertogenbosch 1775.506394 's-Hertogenbosch Aa en Hunze 1848.019292 Aa en Hunze Aalburg 1701.008568 Aalburg Aalsmeer 1550.837018 Aalsmeer

from IPython.display import display, IFrame
import folium

geo_path = r'Gemeentegrenzen_2016_zonder_water_simplified_wgs84.geojson'


ref_map = folium.Map(
    location=[52.139177, 5.327108], # This will center the view on the world map where the Netherlands is located
    tiles='Mapbox Bright',          # This creates a base map and in this case its the Mapbox Bright basemap
    zoom_start=8)                   # This will zoom in on the center of view to get the Netherlands in full frame

ref_map.choropleth(
    # This is the path to the geojson file that contains all the municipality shapes and locations
    geo_path=geo_path,
    # We will use the p_results dataframe for the choropleth mapping
    data=p_results,
    # municipality will be used for the mapping key and expenditure for its value
    columns=['municipality', 'expenditure'],
    # Use GM_NAAM (short for municipality name) as keys for colormapping
    key_on='feature.properties.GM_NAAM',
    # We are going to use a color map from yellow to green
    fill_color='YlGn',
    # This gives municipality shapes some opacity so that we can still see the background
    fill_opacity=0.7,
    # This gives the lines around the municipality shapes some opacity so that they don't stand out too much
    line_opacity=0.2,
    # The legend
    legend_name='health care expenditure per head')

ref_map.save('health_expenditure.html')              # This will save the map in a HTML format

display(IFrame('health_expenditure.html', 800,800))  # This displays the map in an Iframe

<IPython.lib.display.IFrame at 0x10ff2d160>


Exercise

Plot number of citizens per municipality using color scheme 'OrRd'.


2 Health care expenditure and age

The municipality data set above does not give the health care expenditure per age; only per age group (like 0-4 year olds). So we load another data set that does feature health care expenditure per age.

2.1 read in the data

Again, we use data from Vektis. We import the data, which is a 'csv' file with ";" as separator between columns. We also import some libraries.

Then we look at the columns (variables) in the data.

import numpy as np
import pandas as pd
import matplotlib as plt
df_postal_code = pd.read_csv('Vektis_Open_Databestand_Zorgverzekeringswet_2014_-_postcode3.csv', sep = ';')
df_postal_code.dtypes

GESLACHT object LEEFTIJDSKLASSE object POSTCODE_3 float64 AANTAL_BSN int64 AANTAL_VERZEKERDEJAREN float64 KOSTEN_MEDISCH_SPECIALISTISCHE_ZORG float64 KOSTEN_FARMACIE float64 KOSTEN_TWEEDELIJNS_GGZ float64 KOSTEN_HUISARTS_INSCHRIJFTARIEF float64 KOSTEN_HUISARTS_CONSULT float64 KOSTEN_HUISARTS_OVERIG float64 KOSTEN_HULPMIDDELEN float64 KOSTEN_MONDZORG float64 KOSTEN_PARAMEDISCHE_ZORG_FYSIOTHERAPIE float64 KOSTEN_PARAMEDISCHE_ZORG_OVERIG float64 KOSTEN_ZIEKENVERVOER_ZITTEND float64 KOSTEN_ZIEKENVERVOER_LIGGEND float64 KOSTEN_KRAAMZORG float64 KOSTEN_VERLOSKUNDIGE_ZORG float64 KOSTEN_GENERALISTISCHE_BASIS_GGZ float64 KOSTEN_GRENSOVERSCHRIJDENDE_ZORG float64 KOSTEN_EERSTELIJNS_ONDERSTEUNING float64 KOSTEN_GERIATRISCHE_REVALIDATIEZORG float64 KOSTEN_OVERIG float64 dtype: object

This looks very much like the data set above, so we want to do the same steps to get the data into the shape we want. In fact, if you go to the website Vektis there are similar data sets for other years. Copy and paste the steps above and then apply these steps to the new data sets is asking for trouble:

  • you are likely to make mistakes with copy/paste
  • if you figure out that you want to change one of your commands, you have to change all the pasted versions as well

One solution in python is to define a function that does all these steps for you and apply this function to all the data sets that you want to work with.

cost_categories_under_deductible = ['KOSTEN_MEDISCH_SPECIALISTISCHE_ZORG', 'KOSTEN_MONDZORG', 'KOSTEN_FARMACIE', 'KOSTEN_HULPMIDDELEN', 'KOSTEN_PARAMEDISCHE_ZORG_FYSIOTHERAPIE', 'KOSTEN_PARAMEDISCHE_ZORG_OVERIG', 'KOSTEN_ZIEKENVERVOER_ZITTEND', 'KOSTEN_ZIEKENVERVOER_LIGGEND', 'KOSTEN_GRENSOVERSCHRIJDENDE_ZORG', 'KOSTEN_GERIATRISCHE_REVALIDATIEZORG', 'KOSTEN_OVERIG']




def get_data_into_shape(df):
    df['health_expenditure_under_deductible'] = df[cost_categories_under_deductible].sum(axis=1)
    df = df.rename_axis({
        'GESLACHT':'sex',
        'LEEFTIJDSKLASSE':'age',
        'GEMEENTENAAM':'MUNICIPALITY',
        'AANTAL_BSN':'number_citizens',
        'KOSTEN_MEDISCH_SPECIALISTISCHE_ZORG':'hospital_care',
        'KOSTEN_FARMACIE':'pharmaceuticals',
        'KOSTEN_TWEEDELIJNS_GGZ':'mental_care',
        'KOSTEN_HUISARTS_INSCHRIJFTARIEF':'GP_capitation',
        'KOSTEN_HUISARTS_CONSULT':'GP_fee_for_service',
        'KOSTEN_HUISARTS_OVERIG':'GP_other',
        'KOSTEN_MONDZORG':'dental care',
        'KOSTEN_PARAMEDISCHE_ZORG_FYSIOTHERAPIE':'physiotherapy',
        'KOSTEN_KRAAMZORG':'maternity_care',
        'KOSTEN_VERLOSKUNDIGE_ZORG':'obstetrics'
    }, axis='columns')
    df.drop(['AANTAL_VERZEKERDEJAREN',
             'KOSTEN_HULPMIDDELEN',
             'KOSTEN_PARAMEDISCHE_ZORG_OVERIG',
             'KOSTEN_ZIEKENVERVOER_ZITTEND',
             'KOSTEN_ZIEKENVERVOER_LIGGEND',
             'KOSTEN_GRENSOVERSCHRIJDENDE_ZORG',
             'KOSTEN_GERIATRISCHE_REVALIDATIEZORG',
             'KOSTEN_OVERIG',
             'KOSTEN_GENERALISTISCHE_BASIS_GGZ',
             'KOSTEN_EERSTELIJNS_ONDERSTEUNING'],inplace=True,axis=1)
    df.drop(df.index[[0]], inplace=True)
    df['sex'] = df['sex'].astype('category')
    df['age'] = df['age'].astype('category')
    return df

df_postal_code = get_data_into_shape(df_postal_code)

df_postal_code.head()

sex age POSTCODE_3 number_citizens hospital_care pharmaceuticals \ 1 M 0 0.0 366 1372209.26 31191.20 2 M 0 101.0 590 1682944.17 25898.73 3 M 0 102.0 295 1553933.53 29514.18 4 M 0 103.0 288 827427.31 19263.79 5 M 0 105.0 998 2965316.12 61610.42

mental_care GP_capitation GP_fee_for_service GP_other dental care \ 1 285.98 5548.60 5540.05 11525.93 681.02 2 20774.91 9816.63 10130.12 20532.03 0.00 3 7970.01 5317.49 6576.70 17426.30 21.29 4 941.40 5014.97 5708.41 14168.90 0.00 5 4780.48 16842.06 19676.01 43794.06 166.98

physiotherapy maternity_care obstetrics \ 1 12150.91 0.0 0.0 2 17777.00 0.0 0.0 3 20459.17 0.0 0.0 4 9098.71 0.0 0.0 5 42332.18 0.0 0.0

health_expenditure_under_deductible 1 1425823.15 2 1753560.87 3 1617184.58 4 865867.07 5 3118357.71

sex age POSTCODE_3 number_citizens hospital_care pharmaceuticals mental_care GP_capitation GP_fee_for_service GP_other dental care physiotherapy maternity_care obstetrics health_expenditure_under_deductible
1 M 0 0.0 366 1372209.26 31191.20 285.98 5548.60 5540.05 11525.93 681.02 12150.91 0.0 0.0 1425823.15
2 M 0 101.0 590 1682944.17 25898.73 20774.91 9816.63 10130.12 20532.03 0.00 17777.00 0.0 0.0 1753560.87
3 M 0 102.0 295 1553933.53 29514.18 7970.01 5317.49 6576.70 17426.30 21.29 20459.17 0.0 0.0 1617184.58
4 M 0 103.0 288 827427.31 19263.79 941.40 5014.97 5708.41 14168.90 0.00 9098.71 0.0 0.0 865867.07
5 M 0 105.0 998 2965316.12 61610.42 4780.48 16842.06 19676.01 43794.06 166.98 42332.18 0.0 0.0 3118357.71

The first three columns are 'sex', 'age' and 'postal code' (3 digit). These 3 variables combined determine a unique observation. We think of these observations as if they are from an individual (but an observation is an average, like the average over 18 year old males in postal code 102).

Note that the first postal code is '000' which python thinks of as '0.0'. The data description explains that postal code '000' is used to aggregate people who live in a postal code with so few people that the privacy of their data is no longer guaranteed. As we want to think of sex, age and postal code as an observation, we drop the first row (labelled as 0) of the dataframe.

df_postal_code.drop(df_postal_code.index[[0]], inplace=True)
df_postal_code.head()

sex age POSTCODE_3 number_citizens hospital_care pharmaceuticals \ 2 M 0 101.0 590 1682944.17 25898.73 3 M 0 102.0 295 1553933.53 29514.18 4 M 0 103.0 288 827427.31 19263.79 5 M 0 105.0 998 2965316.12 61610.42 6 M 0 106.0 1056 3716382.22 87140.60

mental_care GP_capitation GP_fee_for_service GP_other dental care \ 2 20774.91 9816.63 10130.12 20532.03 0.00 3 7970.01 5317.49 6576.70 17426.30 21.29 4 941.40 5014.97 5708.41 14168.90 0.00 5 4780.48 16842.06 19676.01 43794.06 166.98 6 25006.18 19517.84 24045.35 65572.64 114.05

physiotherapy maternity_care obstetrics \ 2 17777.00 0.0 0.0 3 20459.17 0.0 0.0 4 9098.71 0.0 0.0 5 42332.18 0.0 0.0 6 28299.76 0.0 0.0

health_expenditure_under_deductible 2 1753560.87 3 1617184.58 4 865867.07 5 3118357.71 6 3885368.16

sex age POSTCODE_3 number_citizens hospital_care pharmaceuticals mental_care GP_capitation GP_fee_for_service GP_other dental care physiotherapy maternity_care obstetrics health_expenditure_under_deductible
2 M 0 101.0 590 1682944.17 25898.73 20774.91 9816.63 10130.12 20532.03 0.00 17777.00 0.0 0.0 1753560.87
3 M 0 102.0 295 1553933.53 29514.18 7970.01 5317.49 6576.70 17426.30 21.29 20459.17 0.0 0.0 1617184.58
4 M 0 103.0 288 827427.31 19263.79 941.40 5014.97 5708.41 14168.90 0.00 9098.71 0.0 0.0 865867.07
5 M 0 105.0 998 2965316.12 61610.42 4780.48 16842.06 19676.01 43794.06 166.98 42332.18 0.0 0.0 3118357.71
6 M 0 106.0 1056 3716382.22 87140.60 25006.18 19517.84 24045.35 65572.64 114.05 28299.76 0.0 0.0 3885368.16

The end of the dataframe is given by the following.

df_postal_code.tail(10)

sex age POSTCODE_3 number_citizens hospital_care pharmaceuticals \ 136463 V 90+ 988.0 10 19698.83 4011.31 136464 V 90+ 990.0 151 257046.54 99187.66 136465 V 90+ 991.0 51 95990.43 52682.34 136466 V 90+ 993.0 170 278000.11 124809.41 136467 V 90+ 994.0 38 28454.41 36590.90 136468 V 90+ 995.0 88 200183.72 64315.53 136469 V 90+ 996.0 44 46723.13 39419.64 136470 V 90+ 997.0 38 98954.45 34308.68 136471 V 90+ 998.0 116 168802.54 116907.93 136472 V 90+ 999.0 38 109842.07 40607.06

mental_care GP_capitation GP_fee_for_service GP_other dental care \ 136463 0.00 894.25 567.45 1283.11 0.00 136464 37614.24 11880.75 20144.21 24522.70 1619.74 136465 1102.77 4854.50 14575.32 15550.40 801.65 136466 12652.77 12646.44 13305.75 19040.63 2495.54 136467 2251.30 3652.03 5742.81 16966.62 143.16 136468 3691.37 6438.60 11593.30 15929.91 2729.32 136469 2833.17 4011.35 5459.40 15185.71 979.06 136470 4480.09 3347.05 5395.18 7061.51 897.98 136471 13830.16 10424.40 13527.80 28548.80 1577.91 136472 3273.62 3704.75 4197.46 6763.69 60.97

physiotherapy maternity_care obstetrics \ 136463 0.0 0.0 0.0 136464 12000.6 0.0 0.0 136465 462.0 0.0 0.0 136466 1675.0 0.0 0.0 136467 1409.2 0.0 0.0 136468 4352.1 0.0 0.0 136469 6537.2 0.0 0.0 136470 9201.0 0.0 0.0 136471 1875.6 0.0 0.0 136472 290.0 0.0 0.0

health_expenditure_under_deductible 136463 33079.01 136464 790837.02 136465 207319.25 136466 535215.22 136467 106569.94 136468 378170.95 136469 156304.71 136470 193232.92 136471 455608.75 136472 198874.20

As we saw above, the datatype of age was object, although we would expect integer. Now we see that there is this category 90+, which is not an integer. We will drop this age category as it is quite special. Before we do this, let's count how many people we have in our dataset.

df_postal_code['number_citizens'].sum()

16885677

That is, almost 17 million people, which is about right.

Let's drop the '90+' category and turn age into an integer variable.

df_postal_code = df_postal_code[(df_postal_code['age'] != '90+')]
df_postal_code['age'] = df_postal_code['age'].astype(int)

Let's summarize the variables that are numeric (integer or float). For each of these variables we have 135,063 observations (that is, combinations of sex, age and postal code). The mean for number of citizens is 124. With \(135,063*124\) we are close to 17 million again.

df_postal_code.describe()

age POSTCODE_3 number_citizens hospital_care \ count 135063.000000 135063.000000 135063.000000 1.350630e+05 mean 43.753959 541.250002 123.950327 1.534887e+05 std 25.535582 258.016742 129.491359 1.965840e+05 min 0.000000 0.000000 10.000000 -2.300980e+04 25% 22.000000 318.000000 40.000000 3.055294e+04 50% 44.000000 539.000000 82.000000 8.262627e+04 75% 66.000000 763.000000 161.000000 2.006986e+05 max 89.000000 999.000000 2228.000000 5.263426e+06

pharmaceuticals mental_care GP_capitation GP_fee_for_service \ count 1.350630e+05 135063.000000 135063.000000 135063.000000 mean 3.146225e+04 23296.150212 7693.165768 4528.021995 std 4.483099e+04 45078.454602 7676.781993 5060.276133 min -2.857890e+03 -29164.050000 0.000000 0.000000 25% 5.034130e+03 299.260000 2585.970000 1351.340000 50% 1.547225e+04 5702.020000 5244.990000 2900.080000 75% 4.037718e+04 24546.195000 10140.510000 5899.600000 max 1.546412e+06 885045.050000 155453.330000 194903.830000

GP_other dental care physiotherapy maternity_care \ count 135063.000000 135063.000000 135063.000000 135063.000000 mean 5938.297319 5438.213219 3287.127334 2201.230080 std 6329.471736 10190.600213 4659.495994 10426.252898 min 0.000000 -458.190000 -106.800000 -1898.420000 25% 1934.255000 150.075000 143.745000 0.000000 50% 4002.870000 1746.990000 1617.300000 0.000000 75% 7691.255000 6293.135000 4463.975000 0.000000 max 276119.620000 254585.130000 106169.130000 399960.460000

obstetrics health_expenditure_under_deductible count 135063.000000 1.350630e+05 mean 1612.545851 2.192913e+05 std 7853.598487 2.719250e+05 min 0.000000 0.000000e+00 25% 0.000000 4.791477e+04 50% 0.000000 1.234374e+05 75% 0.000000 2.872767e+05 max 321751.460000 9.012553e+06

age POSTCODE_3 number_citizens hospital_care pharmaceuticals mental_care GP_capitation GP_fee_for_service GP_other dental care physiotherapy maternity_care obstetrics health_expenditure_under_deductible
count 135063.000000 135063.000000 135063.000000 1.350630e+05 1.350630e+05 135063.000000 135063.000000 135063.000000 135063.000000 135063.000000 135063.000000 135063.000000 135063.000000 1.350630e+05
mean 43.753959 541.250002 123.950327 1.534887e+05 3.146225e+04 23296.150212 7693.165768 4528.021995 5938.297319 5438.213219 3287.127334 2201.230080 1612.545851 2.192913e+05
std 25.535582 258.016742 129.491359 1.965840e+05 4.483099e+04 45078.454602 7676.781993 5060.276133 6329.471736 10190.600213 4659.495994 10426.252898 7853.598487 2.719250e+05
min 0.000000 0.000000 10.000000 -2.300980e+04 -2.857890e+03 -29164.050000 0.000000 0.000000 0.000000 -458.190000 -106.800000 -1898.420000 0.000000 0.000000e+00
25% 22.000000 318.000000 40.000000 3.055294e+04 5.034130e+03 299.260000 2585.970000 1351.340000 1934.255000 150.075000 143.745000 0.000000 0.000000 4.791477e+04
50% 44.000000 539.000000 82.000000 8.262627e+04 1.547225e+04 5702.020000 5244.990000 2900.080000 4002.870000 1746.990000 1617.300000 0.000000 0.000000 1.234374e+05
75% 66.000000 763.000000 161.000000 2.006986e+05 4.037718e+04 24546.195000 10140.510000 5899.600000 7691.255000 6293.135000 4463.975000 0.000000 0.000000 2.872767e+05
max 89.000000 999.000000 2228.000000 5.263426e+06 1.546412e+06 885045.050000 155453.330000 194903.830000 276119.620000 254585.130000 106169.130000 399960.460000 321751.460000 9.012553e+06
df_postal_code.info()

<class 'pandas.core.frame.DataFrame'> Int64Index: 135063 entries, 2 to 135741 Data columns (total 15 columns): sex 135063 non-null category age 135063 non-null int64 POSTCODE_3 135063 non-null float64 number_citizens 135063 non-null int64 hospital_care 135063 non-null float64 pharmaceuticals 135063 non-null float64 mental_care 135063 non-null float64 GP_capitation 135063 non-null float64 GP_fee_for_service 135063 non-null float64 GP_other 135063 non-null float64 dental care 135063 non-null float64 physiotherapy 135063 non-null float64 maternity_care 135063 non-null float64 obstetrics 135063 non-null float64 health_expenditure_under_deductible 135063 non-null float64 dtypes: category(1), float64(12), int64(2) memory usage: 15.6 MB

Now let's define the costs per head. For each observation, we divide the total health care costs (under the deductible) for a combination of sex, age and postal code by the number of people in this combination of sex, age and postal code. This gives the health costs per head.

df_postal_code['health_costs_per_head'] = df_postal_code['health_expenditure_under_deductible']/df_postal_code['number_citizens']

So for, say, 18 year old males, we have a distribution of costs per head over the different postal codes. For each combination of age and sex, we can look at the average expenditure. With pandas this is easy to do. We use groupby, specify the dimensions over which we want to group, the variable we are interested in and give the function to aggregate (mean, in this case).

costs_per_sex_age = df_postal_code.groupby(['sex','age'])['health_costs_per_head'].mean()

2.2 matplotlib

Then we can plot this distribution of health care expenditure per head with age for males and females.

import matplotlib.pyplot as plt
plt.style.use('seaborn')
fig = plt.figure()
ax = costs_per_sex_age['M'].plot()
ax = costs_per_sex_age['V'].plot()
ax.set_xlabel('age')
ax.set_ylabel('costs per head')
ax.set_title('average costs per age and sex')
ax.legend(['male','female'])
fig.savefig("males.png")

males.png


Exercise

Finish the following code block to show how total obstetrics vary with 'age' and 'sex' (what would you guess…).

obstetrics_per_sex_age = df_postal_code.groupby(['sex','age'])['obstetrics'].sum()

....

fig.savefig("obstetrics.png")

We can plot a histogram of the distribution of hospital care expenditure across postal code areas.

hospital_care_expenditure = df_postal_code.groupby(['age','POSTCODE_3'])['hospital_care'].sum()

plt.clf()
plt.hist(hospital_care_expenditure[7],normed = True, bins = 100)
plt.show()

Exercise

In which fraction of postal code areas does hospital expenditures on 50 year olds exceed 50000 euro? Finish the following code block to find out.

sum()/len()

Suppose you are interested in the effect of the deductible on health care expenditure. Why would the following graph help for this?

plt.style.use('seaborn')
plt.clf()
age_range = [14,15,16,17,19,20,21,22]

plt.plot(age_range,costs_per_sex_age['M'][age_range], marker='.', label = 'male')
plt.plot(age_range,costs_per_sex_age['V'][age_range], marker='.', label = 'female')
plt.xlabel('age')
plt.ylabel('health care costs')
plt.legend()
fig.savefig('fig14to22.png')

fig14to22.png

2.3 plotly

Instead of matplotlib to plot, we can also use plotly. With plotly you can make interactive graphs. The graph runs on plotly's servers and can for instance be included in presentations.

We are going to plot the cumulative distribution functions of health care expenditure for different age groups. We first define the cumulative distribution function ecdf.

def ecdf(data):
    x = np.sort(data)
    y = np.arange(1.0, len(x)+1.0) / len(x)
    return x, y

Then we define the \(x\) and \(y\) coordinates of the functions we want to plot: the ecdf of health care expenditures for ages 16, 17, 19 and 20.

x_16, y_16 = ecdf(df_postal_code.health_costs_per_head[(df_postal_code['age'] == 16)])
x_17, y_17 = ecdf(df_postal_code.health_costs_per_head[(df_postal_code['age'] == 17)])
x_19, y_19 = ecdf(df_postal_code.health_costs_per_head[(df_postal_code['age'] == 19)])
x_20, y_20 = ecdf(df_postal_code.health_costs_per_head[(df_postal_code['age'] == 20)])


We import plotly.

import plotly.plotly as py
from plotly.graph_objs import *
import plotly.tools as tls

Finally, we define the graph itself. We specify the "Scatter's" and the layout. The web address can be used if you want to include this graph in a presentation.

age16 = Scatter(
    x=x_16,
    y=y_16,
    mode='markers+lines',
    name = 'age 16'
)
age17 = Scatter(
    x=x_17,
    y=y_17,
    mode='markers+lines',
    name = 'age 17'
)
age19 = Scatter(
    x=x_19,
    y=y_19,
    mode='markers+lines',
    name = 'age 19'
)
age20 = Scatter(
    x=x_20,
    y=y_20,
    mode='markers+lines',
    name = 'age 20'
)

layout = Layout(
    title='Health care expend. distribution functions',
    xaxis=XAxis(
        range=[0,3000],
        title='expenditure per head',
        titlefont=Font(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
    yaxis=YAxis(
        title='cum. distribution function',
        titlefont=Font(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    )
)

data = Data([age16,age17,age19,age20])
fig = Figure(data=data, layout=layout)
py.plot(fig, filename='Distribution functions of health care expenditure per head')
tls.embed("https://plot.ly/~janboone/301")

<plotly.tools.PlotlyDisplay object>

3 A first look at machine learning

Python is used a lot in data science. If you are interested you can check out libraries like TensorFlow and keras. We will have a (brief) look at scikit-learn. If you want to know more, go to Datacamp and follow the course before your subscription runs out…

We will use the data set above and see whether we can distinguish different age-categories based on their health care expenditure. From the dataframe we select the age categories 25 and 70. Then we use the expenditure per head in each category to predict the age-sex category of this postal code area.

As above we import the relevant libraries and data.

import numpy as np
import pandas as pd
import matplotlib as plt
df_postal_code = pd.read_csv('Vektis_Open_Databestand_Zorgverzekeringswet_2014_-_postcode3.csv', sep = ';')
df_postal_code.dtypes

GESLACHT object LEEFTIJDSKLASSE object POSTCODE_3 float64 AANTAL_BSN int64 AANTAL_VERZEKERDEJAREN float64 KOSTEN_MEDISCH_SPECIALISTISCHE_ZORG float64 KOSTEN_FARMACIE float64 KOSTEN_TWEEDELIJNS_GGZ float64 KOSTEN_HUISARTS_INSCHRIJFTARIEF float64 KOSTEN_HUISARTS_CONSULT float64 KOSTEN_HUISARTS_OVERIG float64 KOSTEN_HULPMIDDELEN float64 KOSTEN_MONDZORG float64 KOSTEN_PARAMEDISCHE_ZORG_FYSIOTHERAPIE float64 KOSTEN_PARAMEDISCHE_ZORG_OVERIG float64 KOSTEN_ZIEKENVERVOER_ZITTEND float64 KOSTEN_ZIEKENVERVOER_LIGGEND float64 KOSTEN_KRAAMZORG float64 KOSTEN_VERLOSKUNDIGE_ZORG float64 KOSTEN_GENERALISTISCHE_BASIS_GGZ float64 KOSTEN_GRENSOVERSCHRIJDENDE_ZORG float64 KOSTEN_EERSTELIJNS_ONDERSTEUNING float64 KOSTEN_GERIATRISCHE_REVALIDATIEZORG float64 KOSTEN_OVERIG float64 dtype: object

We now use a slightly different way to get the data into shape and hence we rename the function.

def get_data_into_shape_2(df):
    df = df.rename_axis({
        'GESLACHT':'sex',
        'LEEFTIJDSKLASSE':'age',
        'GEMEENTENAAM':'MUNICIPALITY',
        'AANTAL_BSN':'number_citizens',
        'KOSTEN_MEDISCH_SPECIALISTISCHE_ZORG':'hospital_care',
        'KOSTEN_FARMACIE':'pharmaceuticals',
        'KOSTEN_TWEEDELIJNS_GGZ':'mental_care',
        'KOSTEN_HUISARTS_INSCHRIJFTARIEF':'GP_capitation',
        'KOSTEN_HUISARTS_CONSULT':'GP_fee_for_service',
        'KOSTEN_HUISARTS_OVERIG':'GP_other',
        'KOSTEN_MONDZORG':'dental_care',
        'KOSTEN_PARAMEDISCHE_ZORG_FYSIOTHERAPIE':'physiotherapy',
        'KOSTEN_KRAAMZORG':'maternity_care',
        'KOSTEN_VERLOSKUNDIGE_ZORG':'obstetrics',
        'KOSTEN_GERIATRISCHE_REVALIDATIEZORG':'geriatrics'
                        }, axis='columns')
    df.drop(['AANTAL_VERZEKERDEJAREN',
             'KOSTEN_HULPMIDDELEN',
             'KOSTEN_PARAMEDISCHE_ZORG_OVERIG',
             'KOSTEN_ZIEKENVERVOER_ZITTEND',
             'KOSTEN_ZIEKENVERVOER_LIGGEND',
             'KOSTEN_GRENSOVERSCHRIJDENDE_ZORG',
             'KOSTEN_OVERIG',
             'KOSTEN_GENERALISTISCHE_BASIS_GGZ',
             'KOSTEN_EERSTELIJNS_ONDERSTEUNING'],inplace=True,axis=1)
    df.drop(df.index[[0]], inplace=True)
    return df

With the function above, we first get the data into the shape that we want.

df_postal_code = get_data_into_shape_2(df_postal_code)

We are going to consider the ages 25 and 70. As age can be integer or string, we include both in the list of age-values that we wish to select.

df_25_70 = df_postal_code[df_postal_code['age'].isin(['25','70', 25, 70])]

We define the groups as "25M" for 25 year old males. In order to add the columns "sex" and "age", they need to be strings as in python adding the strings `"abc"+"def"` yields `"abcdef"`; which is exactly what we want.

df_25_70.age = df_25_70.age.astype('str')
df_25_70.sex = df_25_70.sex.astype('str')
df_25_70['target'] = df_25_70.age+df_25_70.sex

This combination of age and sex (4 categories) is the variable that we want to predict. Hence, we call the variable `target`. We turn `target` into a category and find that there are indeed 4 of these categories.

df_25_70['target'] = df_25_70['target'].astype('category')
df_25_70['target'].cat.categories

Index(['25.0M', '25V', '70M', '70V'], dtype='object')

With `.cat.codes` we turn our categories into integers 0, 1, 2, 3.

df_25_70['target'].cat.codes

We redefine expenditures in each care category as per head expenditure in the 3 digit postal code area. We then think of each area as being an "individual". Based on the individuals expenditure per care category, we predict age and sex. The following code redefines for each care category the total expenditure into an expenditure per head.

care_categories = ['hospital_care', 'pharmaceuticals', 'mental_care', 'GP_capitation', 'GP_fee_for_service', 'GP_other', 'dental_care', 'physiotherapy', 'maternity_care', 'obstetrics', 'geriatrics']

for variable in care_categories:
    df_25_70[variable] = df_25_70[variable]/df_25_70['number_citizens']

df_25_70.head()

sex age POSTCODE_3 number_citizens hospital_care pharmaceuticals \ 13802 M 25.0 0.0 235 421.241149 48.540128 13803 M 25.0 101.0 1042 280.514837 54.957889 13804 M 25.0 102.0 302 280.264238 56.674834 13805 M 25.0 103.0 325 527.719046 51.940738 13806 M 25.0 105.0 1318 383.394932 69.232117

mental_care GP_capitation GP_fee_for_service GP_other dental_care \ 13802 565.833574 53.478723 15.990043 42.183702 1.607149 13803 131.044347 51.035441 16.187322 33.947015 3.483599 13804 54.191556 53.066325 15.761258 29.762450 0.000000 13805 994.862554 53.742369 18.363385 34.537046 1.208769 13806 210.617398 52.937853 17.693247 32.583392 2.972602

physiotherapy maternity_care obstetrics geriatrics target 13802 1.490255 0.0 0.0 0.0 25.0M 13803 8.594050 0.0 0.0 0.0 25.0M 13804 22.604437 0.0 0.0 0.0 25.0M 13805 10.232308 0.0 0.0 0.0 25.0M 13806 8.429659 0.0 0.0 0.0 25.0M

sex age POSTCODE_3 number_citizens hospital_care pharmaceuticals mental_care GP_capitation GP_fee_for_service GP_other dental_care physiotherapy maternity_care obstetrics geriatrics target
13802 M 25.0 0.0 235 421.241149 48.540128 565.833574 53.478723 15.990043 42.183702 1.607149 1.490255 0.0 0.0 0.0 25.0M
13803 M 25.0 101.0 1042 280.514837 54.957889 131.044347 51.035441 16.187322 33.947015 3.483599 8.594050 0.0 0.0 0.0 25.0M
13804 M 25.0 102.0 302 280.264238 56.674834 54.191556 53.066325 15.761258 29.762450 0.000000 22.604437 0.0 0.0 0.0 25.0M
13805 M 25.0 103.0 325 527.719046 51.940738 994.862554 53.742369 18.363385 34.537046 1.208769 10.232308 0.0 0.0 0.0 25.0M
13806 M 25.0 105.0 1318 383.394932 69.232117 210.617398 52.937853 17.693247 32.583392 2.972602 8.429659 0.0 0.0 0.0 25.0M

The variable `target` is the variable we would like to predict. Hence, we call it \(y\). We choose a subset of health care expenditure categories (you can experiment with this yourself) as predictors (independent variables) and denote the variables in this subset by \(X\).

y = df_25_70.target

subset_care_categories = ['physiotherapy', 'obstetrics', 'geriatrics', 'pharmaceuticals']
X = df_25_70[subset_care_categories]


In this version of the notebook we use four cost categories to separate the different age-sex types. The algorithm below makes this separation in four dimensional space. To get a first intuition, we can consider the data points in two dimensional space, using pairwise combinations of the cost categories. The function `scatter_matrix` does this for each pairwise combination of the cost categories. By turning the variable `target` into integers 0,1,2,3 we can use this variable to color the points. Each category has its own color. For this translation to integers, we use the `.cat.codes` attribute that we saw above.

from pandas.tools.plotting import scatter_matrix # Import the function to plot a scatterplot
    
%matplotlib inline

scatter_matrix(
    X,  # drop the none feature columns
    figsize=(12, 12),                                             # square figuresize for the matrix
    alpha=0.5,                                                    # alpha of 0.5 to see overlapping dots
    s=50,                                                         # fixed size of 50
    c=[plt.cm.get_cmap('rainbow', 4)(idx) for idx in df_25_70['target'].cat.codes],
    diagonal='kde');                                              # diagonal line are feature distributions

<matplotlib.figure.Figure at 0x10cfe5c88> ob-ipython-839aaf4e559c09388a0b7de83fa61130.png

In order to classify points, we use the k-neighbours method. The idea is as follows. To classify a point, we consider its \(k\) closest neighbours. If the majority of these neighbours have label, say '70M', then we classify this point also as '70M'.

As the goal here is to predict, we do not worry about concepts like normal distribution, p-values, co-linearity. We simply split the data set into two subsets. We estimate (train) the model on the first data set. Then we apply the estimated model on the other (test) data. For the test data we calculate how often we get it right.

From scikit-learn we import the function `train_test_split`. This function splits our data \(X,y\) into a training and a testing data set. The size of the test data set is set at 30% here. We can set the seed (21) for the random number generator –don't worry if this does not mean anything to you. Finally, we stratify the data such that the distribution of labels is the same in the training and testing data.

Here we set the number of neighbours equal to \(k=8\). Higher values of \(k\) give smoother results and lead to a "simpler" model but misses local subtleties. The extreme is where \(k=n\) (the number of observations). Then all observations get the same label (the mode of the distribution).

Then we fit this model `knn` to our training data. After we fitted the model, we can predict labels in the test data set. The score indicates the percentage of labels we got right in the test data.

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)
knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(knn.score(X_test, y_test))

0.77032967033

Now let us focus on women with age 25 and 70. Intuitively, with categories like 'obstetrics' and 'geriatrics' we should be able to separate these categories perfectly. And indeed we are.

df_female = df_25_70[df_25_70['target'].isin(['25V','70V'])]

y_f = df_female.target

subset_care_categories = ['physiotherapy', 'obstetrics', 'geriatrics', 'pharmaceuticals']
X_f = df_female[subset_care_categories]

scatter_matrix(
    X_f,  # drop the none feature columns
    figsize=(12, 12),                                             # square figuresize for the matrix
    alpha=0.5,                                                    # alpha of 0.5 to see overlapping dots
    s=50,                                                         # fixed size of 50
    c=[plt.cm.get_cmap('rainbow', 4)(idx) for idx in df_female['target'].cat.codes],
    diagonal='kde');                                              # diagonal line are feature distributions


<matplotlib.figure.Figure at 0x120f415c0> ob-ipython-59e2750aa18291b80acb71d3a3dfbf1a.png

X_train, X_test, y_train, y_test = train_test_split(X_f, y_f, test_size=0.3, random_state=21, stratify=y_f)
knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(knn.score(X_test, y_test))

0.991169977925