Plotting health data with python

1 Health care expenditures per municipality

We start with health care data on the level of municipalities. We will plot how health care expenditure varies with location.

Some institutional background:

in the Netherlands health insurance is mandatory
people aged 18 or older face a deductible of 365 euros in 2014; no deductible for people below 18
the deductible applies to a number of cost categories in the "basic packages"; see below
people can –voluntarily– increase their deductible with 100, 200, 300, 400 or 500 euro; but our data has no information on this

1.1 loading the data

The following code block loads the data. We use data from Vektis. We import the data, which is a 'csv' file with ";" as separator between columns. We also import the pandas library.

Then we inspect the first ten rows of our data set.

# First lets clear all previous python imports and variables by resetting the python kernel.
%reset -f


import pandas as pd

df_gemeente = pd.read_csv('Vektis_Open_Databestand_Zorgverzekeringswet_2014_-_gemeente.csv', sep = ';')

df_gemeente.head(10)

GESLACHT LEEFTIJDSKLASSE GEMEENTENAAM AANTAL_BSN AANTAL_VERZEKERDEJAREN \ 0 NaN NaN NaN 298383 185664.92 1 M 0 t/m 4 jaar AA EN HUNZE 507 468.83 2 M 0 t/m 4 jaar AALBURG 428 387.28 3 M 0 t/m 4 jaar AALSMEER 876 790.65 4 M 0 t/m 4 jaar AALTEN 667 600.00 5 M 0 t/m 4 jaar ACHTKARSPELEN 810 737.03 6 M 0 t/m 4 jaar ALBLASSERDAM 637 561.21 7 M 0 t/m 4 jaar ALBRANDSWAARD 735 663.22 8 M 0 t/m 4 jaar ALKMAAR 2967 2660.93 9 M 0 t/m 4 jaar ALMELO 1906 1714.63

KOSTEN_MEDISCH_SPECIALISTISCHE_ZORG KOSTEN_FARMACIE \ 0 48661669.94 9219422.33 1 473370.77 43305.49 2 340442.37 23395.86 3 932841.57 54597.52 4 515879.78 43516.71 5 766456.80 88084.16 6 1252440.29 140225.60 7 927635.48 114174.07 8 4005510.06 258936.51 9 3518922.31 254976.19

KOSTEN_TWEEDELIJNS_GGZ KOSTEN_HUISARTS_INSCHRIJFTARIEF \ 0 7475481.90 4092492.35 1 14056.81 26239.56 2 5555.83 21673.15 3 34355.66 43971.27 4 2404.19 33749.10 5 6681.81 41477.79 6 9715.96 30705.39 7 2043.52 37028.38 8 22599.87 147227.49 9 87437.11 96757.71

KOSTEN_HUISARTS_CONSULT … KOSTEN_PARAMEDISCHE_ZORG_OVERIG \ 0 1388439.07 … 290539.66 1 12829.29 … 13783.67 2 9417.98 … 15297.60 3 22413.37 … 14105.95 4 18472.64 … 31816.27 5 19157.34 … 30454.49 6 15936.39 … 19202.75 7 21611.09 … 30449.94 8 71141.10 … 67205.26 9 51693.62 … 81595.14

KOSTEN_ZIEKENVERVOER_ZITTEND KOSTEN_ZIEKENVERVOER_LIGGEND \ 0 210313.33 1398151.05 1 638.08 5004.86 2 315.06 6155.09 3 0.00 20465.44 4 1173.82 6282.28 5 204.30 17011.84 6 0.00 11421.05 7 0.00 13027.62 8 4086.51 70806.48 9 7742.31 32328.75

KOSTEN_KRAAMZORG KOSTEN_VERLOSKUNDIGE_ZORG \ 0 1286545.5 1072906.37 1 0.0 0.00 2 0.0 0.00 3 0.0 0.00 4 0.0 0.00 5 0.0 0.00 6 0.0 0.00 7 0.0 0.00 8 0.0 0.00 9 0.0 0.00

KOSTEN_GENERALISTISCHE_BASIS_GGZ KOSTEN_GRENSOVERSCHRIJDENDE_ZORG \ 0 490222.49 21946526.34 1 0.00 2571.36 2 0.00 186.08 3 157.44 1743.72 4 1422.67 62.75 5 0.00 206.11 6 0.00 662.75 7 0.00 1422.76 8 168.55 7153.37 9 3777.82 2492.23

KOSTEN_EERSTELIJNS_ONDERSTEUNING KOSTEN_GERIATRISCHE_REVALIDATIEZORG \ 0 523.35 351533.29 1 31.05 0.00 2 5.40 0.00 3 12.60 0.00 4 3.60 0.00 5 955.80 0.00 6 11.25 0.00 7 7.20 0.00 8 36.90 0.00 9 27.45 0.00

KOSTEN_OVERIG 0 849751.44 1 186.03 2 135.30 3 13285.23 4 107.40 5 453.66 6 165.60 7 368.37 8 2707.21 9 682.51

[10 rows x 24 columns]

	GESLACHT	LEEFTIJDSKLASSE	GEMEENTENAAM	AANTAL_BSN	AANTAL_VERZEKERDEJAREN	KOSTEN_MEDISCH_SPECIALISTISCHE_ZORG	KOSTEN_FARMACIE	KOSTEN_TWEEDELIJNS_GGZ	KOSTEN_HUISARTS_INSCHRIJFTARIEF	KOSTEN_HUISARTS_CONSULT	...	KOSTEN_PARAMEDISCHE_ZORG_OVERIG	KOSTEN_ZIEKENVERVOER_ZITTEND	KOSTEN_ZIEKENVERVOER_LIGGEND	KOSTEN_KRAAMZORG	KOSTEN_VERLOSKUNDIGE_ZORG	KOSTEN_GENERALISTISCHE_BASIS_GGZ	KOSTEN_GRENSOVERSCHRIJDENDE_ZORG	KOSTEN_EERSTELIJNS_ONDERSTEUNING	KOSTEN_GERIATRISCHE_REVALIDATIEZORG	KOSTEN_OVERIG
0	NaN	NaN	NaN	298383	185664.92	48661669.94	9219422.33	7475481.90	4092492.35	1388439.07	...	290539.66	210313.33	1398151.05	1286545.5	1072906.37	490222.49	21946526.34	523.35	351533.29	849751.44
1	M	0 t/m 4 jaar	AA EN HUNZE	507	468.83	473370.77	43305.49	14056.81	26239.56	12829.29	...	13783.67	638.08	5004.86	0.0	0.00	0.00	2571.36	31.05	0.00	186.03
2	M	0 t/m 4 jaar	AALBURG	428	387.28	340442.37	23395.86	5555.83	21673.15	9417.98	...	15297.60	315.06	6155.09	0.0	0.00	0.00	186.08	5.40	0.00	135.30
3	M	0 t/m 4 jaar	AALSMEER	876	790.65	932841.57	54597.52	34355.66	43971.27	22413.37	...	14105.95	0.00	20465.44	0.0	0.00	157.44	1743.72	12.60	0.00	13285.23
4	M	0 t/m 4 jaar	AALTEN	667	600.00	515879.78	43516.71	2404.19	33749.10	18472.64	...	31816.27	1173.82	6282.28	0.0	0.00	1422.67	62.75	3.60	0.00	107.40
5	M	0 t/m 4 jaar	ACHTKARSPELEN	810	737.03	766456.80	88084.16	6681.81	41477.79	19157.34	...	30454.49	204.30	17011.84	0.0	0.00	0.00	206.11	955.80	0.00	453.66
6	M	0 t/m 4 jaar	ALBLASSERDAM	637	561.21	1252440.29	140225.60	9715.96	30705.39	15936.39	...	19202.75	0.00	11421.05	0.0	0.00	0.00	662.75	11.25	0.00	165.60
7	M	0 t/m 4 jaar	ALBRANDSWAARD	735	663.22	927635.48	114174.07	2043.52	37028.38	21611.09	...	30449.94	0.00	13027.62	0.0	0.00	0.00	1422.76	7.20	0.00	368.37
8	M	0 t/m 4 jaar	ALKMAAR	2967	2660.93	4005510.06	258936.51	22599.87	147227.49	71141.10	...	67205.26	4086.51	70806.48	0.0	0.00	168.55	7153.37	36.90	0.00	2707.21
9	M	0 t/m 4 jaar	ALMELO	1906	1714.63	3518922.31	254976.19	87437.11	96757.71	51693.62	...	81595.14	7742.31	32328.75	0.0	0.00	3777.82	2492.23	27.45	0.00	682.51

10 rows × 24 columns

We are going to be interested in total health care costs under the deductible. Hence, we specify the categories that fall under the deductible in the Netherlands in 2014. Further, we want to do this analysis in English. So we relabel the relevant column names into English.

We specify a list of columns (cost categories) across which we want to add costs. As the costs are total costs (not costs per head) we can indeed add these numbers. We leave out mental health care as it features its own financial incentives.

We specify the name of a new variable health_expenditure_under_deductible which is calculated on the list of columns specified and then the function that needs to be applied – sum in this case. Finally, we specify that the function needs to be applied row-wise: axis=1.

Then we give a dictionary with "old" variable names (in Dutch) and the "new" names in English for the variables where we want to use the English names.

cost_categories_under_deductible = ['KOSTEN_MEDISCH_SPECIALISTISCHE_ZORG', 'KOSTEN_MONDZORG', 'KOSTEN_FARMACIE', 'KOSTEN_HULPMIDDELEN', 'KOSTEN_PARAMEDISCHE_ZORG_FYSIOTHERAPIE', 'KOSTEN_PARAMEDISCHE_ZORG_OVERIG', 'KOSTEN_ZIEKENVERVOER_ZITTEND', 'KOSTEN_ZIEKENVERVOER_LIGGEND', 'KOSTEN_GRENSOVERSCHRIJDENDE_ZORG', 'KOSTEN_GERIATRISCHE_REVALIDATIEZORG', 'KOSTEN_OVERIG']
df_gemeente['health_expenditure_under_deductible'] = df_gemeente[cost_categories_under_deductible].sum(axis=1)

df_gemeente = df_gemeente.rename_axis({
'GESLACHT':'sex',
'LEEFTIJDSKLASSE':'age',
'GEMEENTENAAM':'MUNICIPALITY',
'AANTAL_BSN':'number_citizens',
'KOSTEN_MEDISCH_SPECIALISTISCHE_ZORG':'hospital_care',
'KOSTEN_FARMACIE':'pharmaceuticals',
'KOSTEN_TWEEDELIJNS_GGZ':'mental_care',
'KOSTEN_HUISARTS_INSCHRIJFTARIEF':'GP_capitation',
'KOSTEN_HUISARTS_CONSULT':'GP_fee_for_service',
'KOSTEN_HUISARTS_OVERIG':'GP_other',
'KOSTEN_MONDZORG':'dental care',
'KOSTEN_PARAMEDISCHE_ZORG_FYSIOTHERAPIE':'physiotherapy',
'KOSTEN_KRAAMZORG':'maternity_care',
'KOSTEN_VERLOSKUNDIGE_ZORG':'obstetrics'
}, axis='columns')

df_gemeente.dtypes

We drop the columns that we no longer need.

df_gemeente.drop(['AANTAL_VERZEKERDEJAREN',
'KOSTEN_HULPMIDDELEN',
'KOSTEN_PARAMEDISCHE_ZORG_OVERIG',
'KOSTEN_ZIEKENVERVOER_ZITTEND',
'KOSTEN_ZIEKENVERVOER_LIGGEND',
'KOSTEN_GRENSOVERSCHRIJDENDE_ZORG',
'KOSTEN_GERIATRISCHE_REVALIDATIEZORG',
'KOSTEN_OVERIG',
'KOSTEN_GENERALISTISCHE_BASIS_GGZ',
'KOSTEN_EERSTELIJNS_ONDERSTEUNING'],inplace=True,axis=1)
df_gemeente.columns

Let's look again at the first couple of lines of our data set. The head method presents the first 5 line as default.

df_gemeente.head()

We are not interested in the first line, so we drop it. Indeed, our data set now starts with the first municipality 'AA EN HUNZE'.

df_gemeente.drop(df_gemeente.index[[0]], inplace=True)
df_gemeente.head()

sex age MUNICIPALITY number_citizens hospital_care \ 1 M 0 t/m 4 jaar AA EN HUNZE 507 473370.77 2 M 0 t/m 4 jaar AALBURG 428 340442.37 3 M 0 t/m 4 jaar AALSMEER 876 932841.57 4 M 0 t/m 4 jaar AALTEN 667 515879.78 5 M 0 t/m 4 jaar ACHTKARSPELEN 810 766456.80

pharmaceuticals mental_care GP_capitation GP_fee_for_service GP_other \ 1 43305.49 14056.81 26239.56 12829.29 36034.65 2 23395.86 5555.83 21673.15 9417.98 20159.19 3 54597.52 34355.66 43971.27 22413.37 61629.32 4 43516.71 2404.19 33749.10 18472.64 46720.61 5 88084.16 6681.81 41477.79 19157.34 53633.01

dental care physiotherapy maternity_care obstetrics \ 1 9311.14 15968.80 0.0 0.0 2 7213.81 6135.05 0.0 0.0 3 19042.00 20086.43 0.0 0.0 4 12909.41 20762.75 0.0 0.0 5 16695.10 23423.96 0.0 0.0

health_expenditure_under_deductible 1 576750.31 2 406856.55 3 1093297.37 4 652523.86 5 954494.16

Now let's consider data types.

df_gemeente.dtypes

sex object age object MUNICIPALITY object number_citizens int64 hospital_care float64 pharmaceuticals float64 mental_care float64 GP_capitation float64 GP_fee_for_service float64 GP_other float64 dental care float64 physiotherapy float64 maternity_care float64 obstetrics float64 health_expenditure_under_deductible float64 dtype: object

The first three variables are seen as "object", that is, strings. This is fine for MUNICIPALITY but is not quite right for sex and age as these are categories. So let's relabel their types.

df_gemeente['sex'] = df_gemeente['sex'].astype('category')
df_gemeente['age'] = df_gemeente['age'].astype('category')
df_gemeente.info()

<class 'pandas.core.frame.DataFrame'> Int64Index: 14808 entries, 1 to 14808 Data columns (total 15 columns): sex 14808 non-null category age 14808 non-null category MUNICIPALITY 14808 non-null object number_citizens 14808 non-null int64 hospital_care 14808 non-null float64 pharmaceuticals 14808 non-null float64 mental_care 14808 non-null float64 GP_capitation 14808 non-null float64 GP_fee_for_service 14808 non-null float64 GP_other 14808 non-null float64 dental care 14808 non-null float64 physiotherapy 14808 non-null float64 maternity_care 14808 non-null float64 obstetrics 14808 non-null float64 health_expenditure_under_deductible 14808 non-null float64 dtypes: category(2), float64(11), int64(1), object(1) memory usage: 1.6+ MB

We can select rows from a dataframe using loc. Below, we consider the municipality Breda and people between 10 and 14 years old.

df_gemeente.loc[(df_gemeente.MUNICIPALITY == 'BREDA') & (df_gemeente['age'] == '10 t/m 14 jaar')]

sex age MUNICIPALITY number_citizens hospital_care \ 832 M 10 t/m 14 jaar BREDA 5206 2215947.11 8234 V 10 t/m 14 jaar BREDA 4915 1425550.97

pharmaceuticals mental_care GP_capitation GP_fee_for_service \ 832 381799.92 920439.00 301494.04 88705.2 8234 255232.99 564944.21 284269.76 91482.6

GP_other dental care physiotherapy maternity_care obstetrics \ 832 142402.72 677836.15 187429.87 0.0 0.0 8234 134486.26 620303.59 202916.68 0.0 0.0

health_expenditure_under_deductible 832 3756487.05 8234 2795294.25

	sex	age	MUNICIPALITY	number_citizens	hospital_care	pharmaceuticals	mental_care	GP_capitation	GP_fee_for_service	GP_other	dental care	physiotherapy	maternity_care	obstetrics	health_expenditure_under_deductible
832	M	10 t/m 14 jaar	BREDA	5206	2215947.11	381799.92	920439.00	301494.04	88705.2	142402.72	677836.15	187429.87	0.0	0.0	3756487.05
8234	V	10 t/m 14 jaar	BREDA	4915	1425550.97	255232.99	564944.21	284269.76	91482.6	134486.26	620303.59	202916.68	0.0	0.0	2795294.25

If we are interested in one variable, e.g. number_citizens, we can select this as well.

df_gemeente.loc[(df_gemeente.MUNICIPALITY == 'BREDA') & (df_gemeente['age'] == '10 t/m 14 jaar')]['number_citizens']

Exercise

Calculate how many citizens live in Breda between the age of 10 and 14.

We can also give lists of row values that we are interested in:

df_gemeente.loc[(df_gemeente.MUNICIPALITY.isin(['BREDA', 'TILBURG'])) & (df_gemeente['age'] == '10 t/m 14 jaar')]['number_citizens']

832 5206 1098 5817 8234 4915 8500 5651 Name: number_citizens, dtype: int64

Exercise

Calculate total health care expenditures under the deductible for people living in Amsterdam between the ages of 10 and 19 years old.

Now that we have the data ready, we are going to plot health care expenditures on the map of the Netherlands.

1.2 geographical figures

We have map data that links the name of a municipality to coordinates on the map. In this map data, the names of municipalities are capitalized under standard Dutch capitalization like "Aa en Hunze". In our Vektis data, the names of municipalities are written in capitals. There are a number of ways to resolve this. To illustrate the merge command, we use a file with two columns: 1. the names of municipalities all capitalized and 2. normal capitalization. We drop the rows where there is no value for municipality (if such rows exist). We merge our data df_gemeente with the dataframe Gemeentes. We use a "left-merge", so all rows in the first dataframe df_gemeente are kept.

Gemeentes = pd.read_excel('Gemeentes.xlsx')
df_gemeente = df_gemeente.dropna(subset=['MUNICIPALITY'])
df_gem_merged = pd.merge(df_gemeente,Gemeentes,on=['MUNICIPALITY'],how='left')
df_gem_merged.head()

	sex	age	MUNICIPALITY	number_citizens	hospital_care	pharmaceuticals	mental_care	GP_capitation	GP_fee_for_service	GP_other	dental care	physiotherapy	health_expenditure_under_deductible	Municipality
0	M	0 t/m 4 jaar	AA EN HUNZE	507	473370.77	43305.49	14056.81	26239.56	12829.29	36034.65	9311.14	15968.80	576750.31	Aa en Hunze
1	M	0 t/m 4 jaar	AALBURG	428	340442.37	23395.86	5555.83	21673.15	9417.98	20159.19	7213.81	6135.05	406856.55	Aalburg
2	M	0 t/m 4 jaar	AALSMEER	876	932841.57	54597.52	34355.66	43971.27	22413.37	61629.32	19042.00	20086.43	1093297.37	Aalsmeer
3	M	0 t/m 4 jaar	AALTEN	667	515879.78	43516.71	2404.19	33749.10	18472.64	46720.61	12909.41	20762.75	652523.86	Aalten
4	M	0 t/m 4 jaar	ACHTKARSPELEN	810	766456.80	88084.16	6681.81	41477.79	19157.34	53633.01	16695.10	23423.96	954494.16	Achtkarspelen

sex age MUNICIPALITY number_citizens hospital_care \ 0 M 0 t/m 4 jaar AA EN HUNZE 507 473370.77 1 M 0 t/m 4 jaar AALBURG 428 340442.37 2 M 0 t/m 4 jaar AALSMEER 876 932841.57 3 M 0 t/m 4 jaar AALTEN 667 515879.78 4 M 0 t/m 4 jaar ACHTKARSPELEN 810 766456.80

pharmaceuticals mental_care GP_capitation GP_fee_for_service GP_other \ 0 43305.49 14056.81 26239.56 12829.29 36034.65 1 23395.86 5555.83 21673.15 9417.98 20159.19 2 54597.52 34355.66 43971.27 22413.37 61629.32 3 43516.71 2404.19 33749.10 18472.64 46720.61 4 88084.16 6681.81 41477.79 19157.34 53633.01

dental care physiotherapy maternity_care obstetrics \ 0 9311.14 15968.80 0.0 0.0 1 7213.81 6135.05 0.0 0.0 2 19042.00 20086.43 0.0 0.0 3 12909.41 20762.75 0.0 0.0 4 16695.10 23423.96 0.0 0.0

health_expenditure_under_deductible Municipality 0 576750.31 Aa en Hunze 1 406856.55 Aalburg 2 1093297.37 Aalsmeer 3 652523.86 Aalten 4 954494.16 Achtkarspelen

We are going to plot expenditure under the deductible per head for each municipality. We need a couple of steps in order to do this:

we add –for each municipality– the expenditures under the deductible across age groups;
we add –for each municipality– the number of people across age groups;
we divide –for each municipality– the expenditures by the number of people.

With pandas this is straightforward to do using groupby. We do the groupby on the municipality. For each municipality there are different age groups and we need to aggregate over these age groups. We specify the variables that we want to know at the municipality level. In this case health_expenditure_under_deductible and number_citizens. Finally, we specify the function with which to aggregate. Here we use the 'built-in' function sum(). Other functions we can use include mean, min, max etc. You can also specify your own function and apply this using agg().

Then hospital care per head can be defined as the total expenditure per municipality divided by the total number of citizens per municipality.

results = df_gem_merged.groupby('Municipality')[['health_expenditure_under_deductible','number_citizens']].sum()
results['expenditure_per_head'] = results['health_expenditure_under_deductible']/results['number_citizens']
results.head()

	health_expenditure_under_deductible	number_citizens	expenditure_per_head
Municipality
's-Gravenhage	9.115807e+08	509206	1790.200148
's-Hertogenbosch	2.697598e+08	151934	1775.506394
Aa en Hunze	4.696741e+07	25415	1848.019292
Aalburg	2.189368e+07	12871	1701.008568
Aalsmeer	4.864200e+07	31365	1550.837018

health_expenditure_under_deductible number_citizens \ Municipality 's-Gravenhage 9.115807e+08 509206 's-Hertogenbosch 2.697598e+08 151934 Aa en Hunze 4.696741e+07 25415 Aalburg 2.189368e+07 12871 Aalsmeer 4.864200e+07 31365

expenditure_per_head Municipality 's-Gravenhage 1790.200148 's-Hertogenbosch 1775.506394 Aa en Hunze 1848.019292 Aalburg 1701.008568 Aalsmeer 1550.837018

# Create a dataframe
p_results = pd.DataFrame(dict(
    municipality = results.index,
    expenditure = results['expenditure_per_head']
    ))

p_results.head()

	expenditure	municipality
Municipality
's-Gravenhage	1790.200148	's-Gravenhage
's-Hertogenbosch	1775.506394	's-Hertogenbosch
Aa en Hunze	1848.019292	Aa en Hunze
Aalburg	1701.008568	Aalburg
Aalsmeer	1550.837018	Aalsmeer

expenditure municipality Municipality 's-Gravenhage 1790.200148 's-Gravenhage 's-Hertogenbosch 1775.506394 's-Hertogenbosch Aa en Hunze 1848.019292 Aa en Hunze Aalburg 1701.008568 Aalburg Aalsmeer 1550.837018 Aalsmeer

from IPython.display import display, IFrame
import folium

geo_path = r'Gemeentegrenzen_2016_zonder_water_simplified_wgs84.geojson'


ref_map = folium.Map(
    location=[52.139177, 5.327108], # This will center the view on the world map where the Netherlands is located
    tiles='Mapbox Bright',          # This creates a base map and in this case its the Mapbox Bright basemap
    zoom_start=8)                   # This will zoom in on the center of view to get the Netherlands in full frame

ref_map.choropleth(
    # This is the path to the geojson file that contains all the municipality shapes and locations
    geo_path=geo_path,
    # We will use the p_results dataframe for the choropleth mapping
    data=p_results,
    # municipality will be used for the mapping key and expenditure for its value
    columns=['municipality', 'expenditure'],
    # Use GM_NAAM (short for municipality name) as keys for colormapping
    key_on='feature.properties.GM_NAAM',
    # We are going to use a color map from yellow to green
    fill_color='YlGn',
    # This gives municipality shapes some opacity so that we can still see the background
    fill_opacity=0.7,
    # This gives the lines around the municipality shapes some opacity so that they don't stand out too much
    line_opacity=0.2,
    # The legend
    legend_name='health care expenditure per head')

ref_map.save('health_expenditure.html')              # This will save the map in a HTML format

display(IFrame('health_expenditure.html', 800,800))  # This displays the map in an Iframe

<IPython.lib.display.IFrame at 0x10ff2d160>

Exercise

Plot number of citizens per municipality using color scheme 'OrRd'.

2 Health care expenditure and age

The municipality data set above does not give the health care expenditure per age; only per age group (like 0-4 year olds). So we load another data set that does feature health care expenditure per age.

2.1 read in the data

Again, we use data from Vektis. We import the data, which is a 'csv' file with ";" as separator between columns. We also import some libraries.

Then we look at the columns (variables) in the data.

import numpy as np
import pandas as pd
import matplotlib as plt
df_postal_code = pd.read_csv('Vektis_Open_Databestand_Zorgverzekeringswet_2014_-_postcode3.csv', sep = ';')
df_postal_code.dtypes

GESLACHT object LEEFTIJDSKLASSE object POSTCODE_3 float64 AANTAL_BSN int64 AANTAL_VERZEKERDEJAREN float64 KOSTEN_MEDISCH_SPECIALISTISCHE_ZORG float64 KOSTEN_FARMACIE float64 KOSTEN_TWEEDELIJNS_GGZ float64 KOSTEN_HUISARTS_INSCHRIJFTARIEF float64 KOSTEN_HUISARTS_CONSULT float64 KOSTEN_HUISARTS_OVERIG float64 KOSTEN_HULPMIDDELEN float64 KOSTEN_MONDZORG float64 KOSTEN_PARAMEDISCHE_ZORG_FYSIOTHERAPIE float64 KOSTEN_PARAMEDISCHE_ZORG_OVERIG float64 KOSTEN_ZIEKENVERVOER_ZITTEND float64 KOSTEN_ZIEKENVERVOER_LIGGEND float64 KOSTEN_KRAAMZORG float64 KOSTEN_VERLOSKUNDIGE_ZORG float64 KOSTEN_GENERALISTISCHE_BASIS_GGZ float64 KOSTEN_GRENSOVERSCHRIJDENDE_ZORG float64 KOSTEN_EERSTELIJNS_ONDERSTEUNING float64 KOSTEN_GERIATRISCHE_REVALIDATIEZORG float64 KOSTEN_OVERIG float64 dtype: object

This looks very much like the data set above, so we want to do the same steps to get the data into the shape we want. In fact, if you go to the website Vektis there are similar data sets for other years. Copy and paste the steps above and then apply these steps to the new data sets is asking for trouble:

you are likely to make mistakes with copy/paste
if you figure out that you want to change one of your commands, you have to change all the pasted versions as well

One solution in python is to define a function that does all these steps for you and apply this function to all the data sets that you want to work with.

cost_categories_under_deductible = ['KOSTEN_MEDISCH_SPECIALISTISCHE_ZORG', 'KOSTEN_MONDZORG', 'KOSTEN_FARMACIE', 'KOSTEN_HULPMIDDELEN', 'KOSTEN_PARAMEDISCHE_ZORG_FYSIOTHERAPIE', 'KOSTEN_PARAMEDISCHE_ZORG_OVERIG', 'KOSTEN_ZIEKENVERVOER_ZITTEND', 'KOSTEN_ZIEKENVERVOER_LIGGEND', 'KOSTEN_GRENSOVERSCHRIJDENDE_ZORG', 'KOSTEN_GERIATRISCHE_REVALIDATIEZORG', 'KOSTEN_OVERIG']




def get_data_into_shape(df):
    df['health_expenditure_under_deductible'] = df[cost_categories_under_deductible].sum(axis=1)
    df = df.rename_axis({
        'GESLACHT':'sex',
        'LEEFTIJDSKLASSE':'age',
        'GEMEENTENAAM':'MUNICIPALITY',
        'AANTAL_BSN':'number_citizens',
        'KOSTEN_MEDISCH_SPECIALISTISCHE_ZORG':'hospital_care',
        'KOSTEN_FARMACIE':'pharmaceuticals',
        'KOSTEN_TWEEDELIJNS_GGZ':'mental_care',
        'KOSTEN_HUISARTS_INSCHRIJFTARIEF':'GP_capitation',
        'KOSTEN_HUISARTS_CONSULT':'GP_fee_for_service',
        'KOSTEN_HUISARTS_OVERIG':'GP_other',
        'KOSTEN_MONDZORG':'dental care',
        'KOSTEN_PARAMEDISCHE_ZORG_FYSIOTHERAPIE':'physiotherapy',
        'KOSTEN_KRAAMZORG':'maternity_care',
        'KOSTEN_VERLOSKUNDIGE_ZORG':'obstetrics'
    }, axis='columns')
    df.drop(['AANTAL_VERZEKERDEJAREN',
             'KOSTEN_HULPMIDDELEN',
             'KOSTEN_PARAMEDISCHE_ZORG_OVERIG',
             'KOSTEN_ZIEKENVERVOER_ZITTEND',
             'KOSTEN_ZIEKENVERVOER_LIGGEND',
             'KOSTEN_GRENSOVERSCHRIJDENDE_ZORG',
             'KOSTEN_GERIATRISCHE_REVALIDATIEZORG',
             'KOSTEN_OVERIG',
             'KOSTEN_GENERALISTISCHE_BASIS_GGZ',
             'KOSTEN_EERSTELIJNS_ONDERSTEUNING'],inplace=True,axis=1)
    df.drop(df.index[[0]], inplace=True)
    df['sex'] = df['sex'].astype('category')
    df['age'] = df['age'].astype('category')
    return df

df_postal_code = get_data_into_shape(df_postal_code)

df_postal_code.head()

sex age POSTCODE_3 number_citizens hospital_care pharmaceuticals \ 1 M 0 0.0 366 1372209.26 31191.20 2 M 0 101.0 590 1682944.17 25898.73 3 M 0 102.0 295 1553933.53 29514.18 4 M 0 103.0 288 827427.31 19263.79 5 M 0 105.0 998 2965316.12 61610.42

mental_care GP_capitation GP_fee_for_service GP_other dental care \ 1 285.98 5548.60 5540.05 11525.93 681.02 2 20774.91 9816.63 10130.12 20532.03 0.00 3 7970.01 5317.49 6576.70 17426.30 21.29 4 941.40 5014.97 5708.41 14168.90 0.00 5 4780.48 16842.06 19676.01 43794.06 166.98

physiotherapy maternity_care obstetrics \ 1 12150.91 0.0 0.0 2 17777.00 0.0 0.0 3 20459.17 0.0 0.0 4 9098.71 0.0 0.0 5 42332.18 0.0 0.0

health_expenditure_under_deductible 1 1425823.15 2 1753560.87 3 1617184.58 4 865867.07 5 3118357.71

	sex	POSTCODE_3	number_citizens	hospital_care	pharmaceuticals	mental_care	GP_capitation	GP_fee_for_service	GP_other	dental care	physiotherapy	health_expenditure_under_deductible
1	M	0.0	366	1372209.26	31191.20	285.98	5548.60	5540.05	11525.93	681.02	12150.91	1425823.15
2	M	101.0	590	1682944.17	25898.73	20774.91	9816.63	10130.12	20532.03	0.00	17777.00	1753560.87
3	M	102.0	295	1553933.53	29514.18	7970.01	5317.49	6576.70	17426.30	21.29	20459.17	1617184.58
4	M	103.0	288	827427.31	19263.79	941.40	5014.97	5708.41	14168.90	0.00	9098.71	865867.07
5	M	105.0	998	2965316.12	61610.42	4780.48	16842.06	19676.01	43794.06	166.98	42332.18	3118357.71

The first three columns are 'sex', 'age' and 'postal code' (3 digit). These 3 variables combined determine a unique observation. We think of these observations as if they are from an individual (but an observation is an average, like the average over 18 year old males in postal code 102).

Note that the first postal code is '000' which python thinks of as '0.0'. The data description explains that postal code '000' is used to aggregate people who live in a postal code with so few people that the privacy of their data is no longer guaranteed. As we want to think of sex, age and postal code as an observation, we drop the first row (labelled as 0) of the dataframe.

df_postal_code.drop(df_postal_code.index[[0]], inplace=True)
df_postal_code.head()

sex age POSTCODE_3 number_citizens hospital_care pharmaceuticals \ 2 M 0 101.0 590 1682944.17 25898.73 3 M 0 102.0 295 1553933.53 29514.18 4 M 0 103.0 288 827427.31 19263.79 5 M 0 105.0 998 2965316.12 61610.42 6 M 0 106.0 1056 3716382.22 87140.60

mental_care GP_capitation GP_fee_for_service GP_other dental care \ 2 20774.91 9816.63 10130.12 20532.03 0.00 3 7970.01 5317.49 6576.70 17426.30 21.29 4 941.40 5014.97 5708.41 14168.90 0.00 5 4780.48 16842.06 19676.01 43794.06 166.98 6 25006.18 19517.84 24045.35 65572.64 114.05

physiotherapy maternity_care obstetrics \ 2 17777.00 0.0 0.0 3 20459.17 0.0 0.0 4 9098.71 0.0 0.0 5 42332.18 0.0 0.0 6 28299.76 0.0 0.0

health_expenditure_under_deductible 2 1753560.87 3 1617184.58 4 865867.07 5 3118357.71 6 3885368.16

	sex	POSTCODE_3	number_citizens	hospital_care	pharmaceuticals	mental_care	GP_capitation	GP_fee_for_service	GP_other	dental care	physiotherapy	health_expenditure_under_deductible
2	M	101.0	590	1682944.17	25898.73	20774.91	9816.63	10130.12	20532.03	0.00	17777.00	1753560.87
3	M	102.0	295	1553933.53	29514.18	7970.01	5317.49	6576.70	17426.30	21.29	20459.17	1617184.58
4	M	103.0	288	827427.31	19263.79	941.40	5014.97	5708.41	14168.90	0.00	9098.71	865867.07
5	M	105.0	998	2965316.12	61610.42	4780.48	16842.06	19676.01	43794.06	166.98	42332.18	3118357.71
6	M	106.0	1056	3716382.22	87140.60	25006.18	19517.84	24045.35	65572.64	114.05	28299.76	3885368.16

The end of the dataframe is given by the following.

df_postal_code.tail(10)

sex age POSTCODE_3 number_citizens hospital_care pharmaceuticals \ 136463 V 90+ 988.0 10 19698.83 4011.31 136464 V 90+ 990.0 151 257046.54 99187.66 136465 V 90+ 991.0 51 95990.43 52682.34 136466 V 90+ 993.0 170 278000.11 124809.41 136467 V 90+ 994.0 38 28454.41 36590.90 136468 V 90+ 995.0 88 200183.72 64315.53 136469 V 90+ 996.0 44 46723.13 39419.64 136470 V 90+ 997.0 38 98954.45 34308.68 136471 V 90+ 998.0 116 168802.54 116907.93 136472 V 90+ 999.0 38 109842.07 40607.06

mental_care GP_capitation GP_fee_for_service GP_other dental care \ 136463 0.00 894.25 567.45 1283.11 0.00 136464 37614.24 11880.75 20144.21 24522.70 1619.74 136465 1102.77 4854.50 14575.32 15550.40 801.65 136466 12652.77 12646.44 13305.75 19040.63 2495.54 136467 2251.30 3652.03 5742.81 16966.62 143.16 136468 3691.37 6438.60 11593.30 15929.91 2729.32 136469 2833.17 4011.35 5459.40 15185.71 979.06 136470 4480.09 3347.05 5395.18 7061.51 897.98 136471 13830.16 10424.40 13527.80 28548.80 1577.91 136472 3273.62 3704.75 4197.46 6763.69 60.97

physiotherapy maternity_care obstetrics \ 136463 0.0 0.0 0.0 136464 12000.6 0.0 0.0 136465 462.0 0.0 0.0 136466 1675.0 0.0 0.0 136467 1409.2 0.0 0.0 136468 4352.1 0.0 0.0 136469 6537.2 0.0 0.0 136470 9201.0 0.0 0.0 136471 1875.6 0.0 0.0 136472 290.0 0.0 0.0

health_expenditure_under_deductible 136463 33079.01 136464 790837.02 136465 207319.25 136466 535215.22 136467 106569.94 136468 378170.95 136469 156304.71 136470 193232.92 136471 455608.75 136472 198874.20

As we saw above, the datatype of age was object, although we would expect integer. Now we see that there is this category 90+, which is not an integer. We will drop this age category as it is quite special. Before we do this, let's count how many people we have in our dataset.

df_postal_code['number_citizens'].sum()

16885677

That is, almost 17 million people, which is about right.

Let's drop the '90+' category and turn age into an integer variable.

df_postal_code = df_postal_code[(df_postal_code['age'] != '90+')]
df_postal_code['age'] = df_postal_code['age'].astype(int)

Let's summarize the variables that are numeric (integer or float). For each of these variables we have 135,063 observations (that is, combinations of sex, age and postal code). The mean for number of citizens is 124. With \(135,063*124\) we are close to 17 million again.

df_postal_code.describe()

age POSTCODE_3 number_citizens hospital_care \ count 135063.000000 135063.000000 135063.000000 1.350630e+05 mean 43.753959 541.250002 123.950327 1.534887e+05 std 25.535582 258.016742 129.491359 1.965840e+05 min 0.000000 0.000000 10.000000 -2.300980e+04 25% 22.000000 318.000000 40.000000 3.055294e+04 50% 44.000000 539.000000 82.000000 8.262627e+04 75% 66.000000 763.000000 161.000000 2.006986e+05 max 89.000000 999.000000 2228.000000 5.263426e+06

pharmaceuticals mental_care GP_capitation GP_fee_for_service \ count 1.350630e+05 135063.000000 135063.000000 135063.000000 mean 3.146225e+04 23296.150212 7693.165768 4528.021995 std 4.483099e+04 45078.454602 7676.781993 5060.276133 min -2.857890e+03 -29164.050000 0.000000 0.000000 25% 5.034130e+03 299.260000 2585.970000 1351.340000 50% 1.547225e+04 5702.020000 5244.990000 2900.080000 75% 4.037718e+04 24546.195000 10140.510000 5899.600000 max 1.546412e+06 885045.050000 155453.330000 194903.830000

GP_other dental care physiotherapy maternity_care \ count 135063.000000 135063.000000 135063.000000 135063.000000 mean 5938.297319 5438.213219 3287.127334 2201.230080 std 6329.471736 10190.600213 4659.495994 10426.252898 min 0.000000 -458.190000 -106.800000 -1898.420000 25% 1934.255000 150.075000 143.745000 0.000000 50% 4002.870000 1746.990000 1617.300000 0.000000 75% 7691.255000 6293.135000 4463.975000 0.000000 max 276119.620000 254585.130000 106169.130000 399960.460000

obstetrics health_expenditure_under_deductible count 135063.000000 1.350630e+05 mean 1612.545851 2.192913e+05 std 7853.598487 2.719250e+05 min 0.000000 0.000000e+00 25% 0.000000 4.791477e+04 50% 0.000000 1.234374e+05 75% 0.000000 2.872767e+05 max 321751.460000 9.012553e+06

	age	POSTCODE_3	number_citizens	hospital_care	pharmaceuticals	mental_care	GP_capitation	GP_fee_for_service	GP_other	dental care	physiotherapy	maternity_care	obstetrics	health_expenditure_under_deductible
count	135063.000000	135063.000000	135063.000000	1.350630e+05	1.350630e+05	135063.000000	135063.000000	135063.000000	135063.000000	135063.000000	135063.000000	135063.000000	135063.000000	1.350630e+05
mean	43.753959	541.250002	123.950327	1.534887e+05	3.146225e+04	23296.150212	7693.165768	4528.021995	5938.297319	5438.213219	3287.127334	2201.230080	1612.545851	2.192913e+05
std	25.535582	258.016742	129.491359	1.965840e+05	4.483099e+04	45078.454602	7676.781993	5060.276133	6329.471736	10190.600213	4659.495994	10426.252898	7853.598487	2.719250e+05
min	0.000000	0.000000	10.000000	-2.300980e+04	-2.857890e+03	-29164.050000	0.000000	0.000000	0.000000	-458.190000	-106.800000	-1898.420000	0.000000	0.000000e+00
25%	22.000000	318.000000	40.000000	3.055294e+04	5.034130e+03	299.260000	2585.970000	1351.340000	1934.255000	150.075000	143.745000	0.000000	0.000000	4.791477e+04
50%	44.000000	539.000000	82.000000	8.262627e+04	1.547225e+04	5702.020000	5244.990000	2900.080000	4002.870000	1746.990000	1617.300000	0.000000	0.000000	1.234374e+05
75%	66.000000	763.000000	161.000000	2.006986e+05	4.037718e+04	24546.195000	10140.510000	5899.600000	7691.255000	6293.135000	4463.975000	0.000000	0.000000	2.872767e+05
max	89.000000	999.000000	2228.000000	5.263426e+06	1.546412e+06	885045.050000	155453.330000	194903.830000	276119.620000	254585.130000	106169.130000	399960.460000	321751.460000	9.012553e+06

df_postal_code.info()

<class 'pandas.core.frame.DataFrame'> Int64Index: 135063 entries, 2 to 135741 Data columns (total 15 columns): sex 135063 non-null category age 135063 non-null int64 POSTCODE_3 135063 non-null float64 number_citizens 135063 non-null int64 hospital_care 135063 non-null float64 pharmaceuticals 135063 non-null float64 mental_care 135063 non-null float64 GP_capitation 135063 non-null float64 GP_fee_for_service 135063 non-null float64 GP_other 135063 non-null float64 dental care 135063 non-null float64 physiotherapy 135063 non-null float64 maternity_care 135063 non-null float64 obstetrics 135063 non-null float64 health_expenditure_under_deductible 135063 non-null float64 dtypes: category(1), float64(12), int64(2) memory usage: 15.6 MB

Now let's define the costs per head. For each observation, we divide the total health care costs (under the deductible) for a combination of sex, age and postal code by the number of people in this combination of sex, age and postal code. This gives the health costs per head.

df_postal_code['health_costs_per_head'] = df_postal_code['health_expenditure_under_deductible']/df_postal_code['number_citizens']

So for, say, 18 year old males, we have a distribution of costs per head over the different postal codes. For each combination of age and sex, we can look at the average expenditure. With pandas this is easy to do. We use groupby, specify the dimensions over which we want to group, the variable we are interested in and give the function to aggregate (mean, in this case).

costs_per_sex_age = df_postal_code.groupby(['sex','age'])['health_costs_per_head'].mean()

2.2 matplotlib

Then we can plot this distribution of health care expenditure per head with age for males and females.

import matplotlib.pyplot as plt
plt.style.use('seaborn')
fig = plt.figure()
ax = costs_per_sex_age['M'].plot()
ax = costs_per_sex_age['V'].plot()
ax.set_xlabel('age')
ax.set_ylabel('costs per head')
ax.set_title('average costs per age and sex')
ax.legend(['male','female'])
fig.savefig("males.png")

Exercise

Finish the following code block to show how total obstetrics vary with 'age' and 'sex' (what would you guess…).

obstetrics_per_sex_age = df_postal_code.groupby(['sex','age'])['obstetrics'].sum()

....

fig.savefig("obstetrics.png")

We can plot a histogram of the distribution of hospital care expenditure across postal code areas.

hospital_care_expenditure = df_postal_code.groupby(['age','POSTCODE_3'])['hospital_care'].sum()

plt.clf()
plt.hist(hospital_care_expenditure[7],normed = True, bins = 100)
plt.show()

Exercise

In which fraction of postal code areas does hospital expenditures on 50 year olds exceed 50000 euro? Finish the following code block to find out.

sum()/len()

Suppose you are interested in the effect of the deductible on health care expenditure. Why would the following graph help for this?

plt.style.use('seaborn')
plt.clf()
age_range = [14,15,16,17,19,20,21,22]

plt.plot(age_range,costs_per_sex_age['M'][age_range], marker='.', label = 'male')
plt.plot(age_range,costs_per_sex_age['V'][age_range], marker='.', label = 'female')
plt.xlabel('age')
plt.ylabel('health care costs')
plt.legend()
fig.savefig('fig14to22.png')

2.3 plotly

Instead of matplotlib to plot, we can also use plotly. With plotly you can make interactive graphs. The graph runs on plotly's servers and can for instance be included in presentations.

We are going to plot the cumulative distribution functions of health care expenditure for different age groups. We first define the cumulative distribution function ecdf.

def ecdf(data):
    x = np.sort(data)
    y = np.arange(1.0, len(x)+1.0) / len(x)
    return x, y

Then we define the \(x\) and \(y\) coordinates of the functions we want to plot: the ecdf of health care expenditures for ages 16, 17, 19 and 20.

x_16, y_16 = ecdf(df_postal_code.health_costs_per_head[(df_postal_code['age'] == 16)])
x_17, y_17 = ecdf(df_postal_code.health_costs_per_head[(df_postal_code['age'] == 17)])
x_19, y_19 = ecdf(df_postal_code.health_costs_per_head[(df_postal_code['age'] == 19)])
x_20, y_20 = ecdf(df_postal_code.health_costs_per_head[(df_postal_code['age'] == 20)])

We import plotly.

import plotly.plotly as py
from plotly.graph_objs import *
import plotly.tools as tls

Finally, we define the graph itself. We specify the "Scatter's" and the layout. The web address can be used if you want to include this graph in a presentation.

age16 = Scatter(
    x=x_16,
    y=y_16,
    mode='markers+lines',
    name = 'age 16'
)
age17 = Scatter(
    x=x_17,
    y=y_17,
    mode='markers+lines',
    name = 'age 17'
)
age19 = Scatter(
    x=x_19,
    y=y_19,
    mode='markers+lines',
    name = 'age 19'
)
age20 = Scatter(
    x=x_20,
    y=y_20,
    mode='markers+lines',
    name = 'age 20'
)

layout = Layout(
    title='Health care expend. distribution functions',
    xaxis=XAxis(
        range=[0,3000],
        title='expenditure per head',
        titlefont=Font(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
    yaxis=YAxis(
        title='cum. distribution function',
        titlefont=Font(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    )
)

data = Data([age16,age17,age19,age20])
fig = Figure(data=data, layout=layout)
py.plot(fig, filename='Distribution functions of health care expenditure per head')
tls.embed("https://plot.ly/~janboone/301")

<plotly.tools.PlotlyDisplay object>

3 A first look at machine learning

Python is used a lot in data science. If you are interested you can check out libraries like TensorFlow and keras. We will have a (brief) look at scikit-learn. If you want to know more, go to Datacamp and follow the course before your subscription runs out…

We will use the data set above and see whether we can distinguish different age-categories based on their health care expenditure. From the dataframe we select the age categories 25 and 70. Then we use the expenditure per head in each category to predict the age-sex category of this postal code area.

As above we import the relevant libraries and data.

import numpy as np
import pandas as pd
import matplotlib as plt
df_postal_code = pd.read_csv('Vektis_Open_Databestand_Zorgverzekeringswet_2014_-_postcode3.csv', sep = ';')
df_postal_code.dtypes

GESLACHT object LEEFTIJDSKLASSE object POSTCODE_3 float64 AANTAL_BSN int64 AANTAL_VERZEKERDEJAREN float64 KOSTEN_MEDISCH_SPECIALISTISCHE_ZORG float64 KOSTEN_FARMACIE float64 KOSTEN_TWEEDELIJNS_GGZ float64 KOSTEN_HUISARTS_INSCHRIJFTARIEF float64 KOSTEN_HUISARTS_CONSULT float64 KOSTEN_HUISARTS_OVERIG float64 KOSTEN_HULPMIDDELEN float64 KOSTEN_MONDZORG float64 KOSTEN_PARAMEDISCHE_ZORG_FYSIOTHERAPIE float64 KOSTEN_PARAMEDISCHE_ZORG_OVERIG float64 KOSTEN_ZIEKENVERVOER_ZITTEND float64 KOSTEN_ZIEKENVERVOER_LIGGEND float64 KOSTEN_KRAAMZORG float64 KOSTEN_VERLOSKUNDIGE_ZORG float64 KOSTEN_GENERALISTISCHE_BASIS_GGZ float64 KOSTEN_GRENSOVERSCHRIJDENDE_ZORG float64 KOSTEN_EERSTELIJNS_ONDERSTEUNING float64 KOSTEN_GERIATRISCHE_REVALIDATIEZORG float64 KOSTEN_OVERIG float64 dtype: object

We now use a slightly different way to get the data into shape and hence we rename the function.

def get_data_into_shape_2(df):
    df = df.rename_axis({
        'GESLACHT':'sex',
        'LEEFTIJDSKLASSE':'age',
        'GEMEENTENAAM':'MUNICIPALITY',
        'AANTAL_BSN':'number_citizens',
        'KOSTEN_MEDISCH_SPECIALISTISCHE_ZORG':'hospital_care',
        'KOSTEN_FARMACIE':'pharmaceuticals',
        'KOSTEN_TWEEDELIJNS_GGZ':'mental_care',
        'KOSTEN_HUISARTS_INSCHRIJFTARIEF':'GP_capitation',
        'KOSTEN_HUISARTS_CONSULT':'GP_fee_for_service',
        'KOSTEN_HUISARTS_OVERIG':'GP_other',
        'KOSTEN_MONDZORG':'dental_care',
        'KOSTEN_PARAMEDISCHE_ZORG_FYSIOTHERAPIE':'physiotherapy',
        'KOSTEN_KRAAMZORG':'maternity_care',
        'KOSTEN_VERLOSKUNDIGE_ZORG':'obstetrics',
        'KOSTEN_GERIATRISCHE_REVALIDATIEZORG':'geriatrics'
                        }, axis='columns')
    df.drop(['AANTAL_VERZEKERDEJAREN',
             'KOSTEN_HULPMIDDELEN',
             'KOSTEN_PARAMEDISCHE_ZORG_OVERIG',
             'KOSTEN_ZIEKENVERVOER_ZITTEND',
             'KOSTEN_ZIEKENVERVOER_LIGGEND',
             'KOSTEN_GRENSOVERSCHRIJDENDE_ZORG',
             'KOSTEN_OVERIG',
             'KOSTEN_GENERALISTISCHE_BASIS_GGZ',
             'KOSTEN_EERSTELIJNS_ONDERSTEUNING'],inplace=True,axis=1)
    df.drop(df.index[[0]], inplace=True)
    return df

With the function above, we first get the data into the shape that we want.

df_postal_code = get_data_into_shape_2(df_postal_code)

We are going to consider the ages 25 and 70. As age can be integer or string, we include both in the list of age-values that we wish to select.

df_25_70 = df_postal_code[df_postal_code['age'].isin(['25','70', 25, 70])]

We define the groups as "25M" for 25 year old males. In order to add the columns "sex" and "age", they need to be strings as in python adding the strings `"abc"+"def"` yields `"abcdef"`; which is exactly what we want.

df_25_70.age = df_25_70.age.astype('str')
df_25_70.sex = df_25_70.sex.astype('str')
df_25_70['target'] = df_25_70.age+df_25_70.sex

This combination of age and sex (4 categories) is the variable that we want to predict. Hence, we call the variable `target`. We turn `target` into a category and find that there are indeed 4 of these categories.

df_25_70['target'] = df_25_70['target'].astype('category')
df_25_70['target'].cat.categories

Index(['25.0M', '25V', '70M', '70V'], dtype='object')

With `.cat.codes` we turn our categories into integers 0, 1, 2, 3.

df_25_70['target'].cat.codes

We redefine expenditures in each care category as per head expenditure in the 3 digit postal code area. We then think of each area as being an "individual". Based on the individuals expenditure per care category, we predict age and sex. The following code redefines for each care category the total expenditure into an expenditure per head.

care_categories = ['hospital_care', 'pharmaceuticals', 'mental_care', 'GP_capitation', 'GP_fee_for_service', 'GP_other', 'dental_care', 'physiotherapy', 'maternity_care', 'obstetrics', 'geriatrics']

for variable in care_categories:
    df_25_70[variable] = df_25_70[variable]/df_25_70['number_citizens']

df_25_70.head()

sex age POSTCODE_3 number_citizens hospital_care pharmaceuticals \ 13802 M 25.0 0.0 235 421.241149 48.540128 13803 M 25.0 101.0 1042 280.514837 54.957889 13804 M 25.0 102.0 302 280.264238 56.674834 13805 M 25.0 103.0 325 527.719046 51.940738 13806 M 25.0 105.0 1318 383.394932 69.232117

mental_care GP_capitation GP_fee_for_service GP_other dental_care \ 13802 565.833574 53.478723 15.990043 42.183702 1.607149 13803 131.044347 51.035441 16.187322 33.947015 3.483599 13804 54.191556 53.066325 15.761258 29.762450 0.000000 13805 994.862554 53.742369 18.363385 34.537046 1.208769 13806 210.617398 52.937853 17.693247 32.583392 2.972602

physiotherapy maternity_care obstetrics geriatrics target 13802 1.490255 0.0 0.0 0.0 25.0M 13803 8.594050 0.0 0.0 0.0 25.0M 13804 22.604437 0.0 0.0 0.0 25.0M 13805 10.232308 0.0 0.0 0.0 25.0M 13806 8.429659 0.0 0.0 0.0 25.0M

	sex	age	POSTCODE_3	number_citizens	hospital_care	pharmaceuticals	mental_care	GP_capitation	GP_fee_for_service	GP_other	dental_care	physiotherapy	target
13802	M	25.0	0.0	235	421.241149	48.540128	565.833574	53.478723	15.990043	42.183702	1.607149	1.490255	25.0M
13803	M	25.0	101.0	1042	280.514837	54.957889	131.044347	51.035441	16.187322	33.947015	3.483599	8.594050	25.0M
13804	M	25.0	102.0	302	280.264238	56.674834	54.191556	53.066325	15.761258	29.762450	0.000000	22.604437	25.0M
13805	M	25.0	103.0	325	527.719046	51.940738	994.862554	53.742369	18.363385	34.537046	1.208769	10.232308	25.0M
13806	M	25.0	105.0	1318	383.394932	69.232117	210.617398	52.937853	17.693247	32.583392	2.972602	8.429659	25.0M

The variable `target` is the variable we would like to predict. Hence, we call it \(y\). We choose a subset of health care expenditure categories (you can experiment with this yourself) as predictors (independent variables) and denote the variables in this subset by \(X\).

y = df_25_70.target

subset_care_categories = ['physiotherapy', 'obstetrics', 'geriatrics', 'pharmaceuticals']
X = df_25_70[subset_care_categories]

In this version of the notebook we use four cost categories to separate the different age-sex types. The algorithm below makes this separation in four dimensional space. To get a first intuition, we can consider the data points in two dimensional space, using pairwise combinations of the cost categories. The function `scatter_matrix` does this for each pairwise combination of the cost categories. By turning the variable `target` into integers 0,1,2,3 we can use this variable to color the points. Each category has its own color. For this translation to integers, we use the `.cat.codes` attribute that we saw above.

from pandas.tools.plotting import scatter_matrix # Import the function to plot a scatterplot
    
%matplotlib inline

scatter_matrix(
    X,  # drop the none feature columns
    figsize=(12, 12),                                             # square figuresize for the matrix
    alpha=0.5,                                                    # alpha of 0.5 to see overlapping dots
    s=50,                                                         # fixed size of 50
    c=[plt.cm.get_cmap('rainbow', 4)(idx) for idx in df_25_70['target'].cat.codes],
    diagonal='kde');                                              # diagonal line are feature distributions

<matplotlib.figure.Figure at 0x10cfe5c88>

In order to classify points, we use the k-neighbours method. The idea is as follows. To classify a point, we consider its \(k\) closest neighbours. If the majority of these neighbours have label, say '70M', then we classify this point also as '70M'.

As the goal here is to predict, we do not worry about concepts like normal distribution, p-values, co-linearity. We simply split the data set into two subsets. We estimate (train) the model on the first data set. Then we apply the estimated model on the other (test) data. For the test data we calculate how often we get it right.

From scikit-learn we import the function `train_test_split`. This function splits our data \(X,y\) into a training and a testing data set. The size of the test data set is set at 30% here. We can set the seed (21) for the random number generator –don't worry if this does not mean anything to you. Finally, we stratify the data such that the distribution of labels is the same in the training and testing data.

Here we set the number of neighbours equal to \(k=8\). Higher values of \(k\) give smoother results and lead to a "simpler" model but misses local subtleties. The extreme is where \(k=n\) (the number of observations). Then all observations get the same label (the mode of the distribution).

Then we fit this model `knn` to our training data. After we fitted the model, we can predict labels in the test data set. The score indicates the percentage of labels we got right in the test data.

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21, stratify=y)
knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(knn.score(X_test, y_test))

0.77032967033

Now let us focus on women with age 25 and 70. Intuitively, with categories like 'obstetrics' and 'geriatrics' we should be able to separate these categories perfectly. And indeed we are.

df_female = df_25_70[df_25_70['target'].isin(['25V','70V'])]

y_f = df_female.target

subset_care_categories = ['physiotherapy', 'obstetrics', 'geriatrics', 'pharmaceuticals']
X_f = df_female[subset_care_categories]

scatter_matrix(
    X_f,  # drop the none feature columns
    figsize=(12, 12),                                             # square figuresize for the matrix
    alpha=0.5,                                                    # alpha of 0.5 to see overlapping dots
    s=50,                                                         # fixed size of 50
    c=[plt.cm.get_cmap('rainbow', 4)(idx) for idx in df_female['target'].cat.codes],
    diagonal='kde');                                              # diagonal line are feature distributions

<matplotlib.figure.Figure at 0x120f415c0>

X_train, X_test, y_train, y_test = train_test_split(X_f, y_f, test_size=0.3, random_state=21, stratify=y_f)
knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print(knn.score(X_test, y_test))

0.991169977925