Getting Eurostat data

1. Introduction
2. Preamble code
3. Data
4. Merging the data
5. Data extensions

1. Introduction

This file describes how we get the data for our project on Dynamic effects of health insurance reform.

In this file we explain how we get the data and from where and at the same time we generate the data. By generating the data in this way, a change in the data by Eurostat will lead to (slightly) different results.

2. Preamble code

Run this code first before running the code of particular data sets. It imports the relevant libraries.

import numpy as np
import pandas as pd
from country_codes import oecd_dictionary, eurostat_dictionary
import eurostat
import xarray as xr
import matplotlib.pyplot as plt
start_year = 2008
end_year = 2018
year_range = np.arange(start_year,end_year+1)

EU_countries = ['Belgium', 'Bulgaria', 'Czechia', 'Denmark',
'Germany (until 1990 former territory of the FRG)', 'Germany','Estonia',
'Ireland', 'Greece', 'Spain', 'France', 'Croatia', 'Italy',
'Cyprus', 'Latvia', 'Lithuania', 'Luxembourg', 'Hungary', 'Malta',
'Netherlands', 'Austria', 'Poland', 'Portugal', 'Romania',
'Slovenia', 'Slovakia', 'Finland', 'Sweden', 'United Kingdom',
'Iceland', 'Liechtenstein', 'Norway', 'Switzerland',
'Bosnia and Herzegovina']

We first describe how we get the data for each variable. Each of these variables is saved in a separate dataset. The final section contains the code that merges the data for these variables into one (big) dataset.

3. Data

3.1. population

population: https://ec.europa.eu/eurostat/databrowser/view/demo_r_d2jan/default/table?lang=en

df = eurostat.get_data_df('demo_r_d2jan')
df.rename({'geo\\TIME_PERIOD':'geo'},inplace=True,axis=1)
age_range = ['Y1', 'Y10', 'Y11', 'Y12', 'Y13', 'Y14', 'Y15',
       'Y16', 'Y17', 'Y18', 'Y19', 'Y2', 'Y20', 'Y21', 'Y22', 'Y23',
       'Y24', 'Y25', 'Y26', 'Y27', 'Y28', 'Y29', 'Y3', 'Y30', 'Y31',
       'Y32', 'Y33', 'Y34', 'Y35', 'Y36', 'Y37', 'Y38', 'Y39', 'Y4',
       'Y40', 'Y41', 'Y42', 'Y43', 'Y44', 'Y45', 'Y46', 'Y47', 'Y48',
       'Y49', 'Y5', 'Y50', 'Y51', 'Y52', 'Y53', 'Y54', 'Y55', 'Y56',
       'Y57', 'Y58', 'Y59', 'Y6', 'Y60', 'Y61', 'Y62', 'Y63', 'Y64',
       'Y65', 'Y66', 'Y67', 'Y68', 'Y69', 'Y7', 'Y70', 'Y71', 'Y72',
       'Y73', 'Y74', 'Y75', 'Y76', 'Y77', 'Y78', 'Y79', 'Y8', 'Y80',
       'Y81', 'Y82', 'Y83', 'Y84', 'Y85', 'Y86', 'Y87', 'Y88', 'Y89',
       'Y9', 'Y90', 'Y91', 'Y92', 'Y93', 'Y94', 'Y95', 'Y96', 'Y97',
       'Y98', 'Y99']
df['country_code'] = [x[:2] for x in df['geo']]
df['country'] = df['country_code'].replace(eurostat_dictionary)
df = df[df.country.isin(EU_countries) & \
        (df.sex.isin(["M","F"])) & (df.age.isin(age_range))]
df.drop(['country_code'],axis=1,inplace=True)
df = pd.melt(df,id_vars=['geo','country','sex','age'],
                        value_vars=[str(y) for y in year_range],
                        var_name='year',value_name='population')
df['nuts'] = [len(x)-2 for x in df.geo]
df = df.query('nuts == 2')
df.drop('nuts',inplace=True, axis=1)
df.rename({'geo':'nuts2', 'age':'age string'},\
          axis=1,inplace=True)
df['age'] = [float(x.replace('Y','')) for x in df['age string']]
df.drop('age string',inplace=True, axis=1)

#df.set_index(['nuts2','year','sex','age'],inplace=True)

df.head()

  nuts2  country sex  year  population  age
2  AT11  Austria   F  2008      1127.0  1.0
3  AT12  Austria   F  2008      7280.0  1.0
4  AT13  Austria   F  2008      8398.0  1.0
6  AT21  Austria   F  2008      2397.0  1.0
7  AT22  Austria   F  2008      5167.0  1.0

3.2. OOP per country

We want a measure of “how generous” a healthcare system is in terms of out-of-pocket (oop) payments. This feature of the healthcare system tends to be defined at the country level. We use household oop payment, HF3, and PC_CHE as unit of measurement.

link to data: https://ec.europa.eu/eurostat/databrowser/view/hlth_sha11_hf/default/table?lang=en

abbreviation	name
HF2	Voluntary health care payment schemes
HF3	Household out-of-pocket payment
PC_CHE	Percentual share of total current health expenditure (CHE)
PPS_HAB	Purchasing power standard (PPS) per inhabitant

df_o = eurostat.get_data_df('hlth_sha11_hf')
df_o.rename({'geo\\TIME_PERIOD':'geo'},inplace=True,axis=1)
df_o['country_code'] = [x[:2] for x in df_o['geo']]
df_o['country'] = df_o['country_code'].replace(eurostat_dictionary)
df_o = df_o[df_o.country.isin(EU_countries)  & \
        (df_o.icha11_hf.isin(["HF2", "HF3"])) &\
        (df_o.unit.isin([ 'PC_CHE','PPS_HAB']))]
df_o.drop(["geo"],axis=1,inplace=True)

df_o = pd.melt(df_o,id_vars=['country','country_code',\
                         'unit','icha11_hf'],
                        value_vars=[str(y) for y in year_range],
                        var_name='year',\
             value_name='health_expenditure')
df_o.set_index(['country','country_code',\
              'year','unit','icha11_hf'],inplace=True)
df_o = df_o.unstack()
df_o.columns = df_o.columns.droplevel()
df_o = df_o.unstack()
df_o.columns = df_o.columns.droplevel()
df_o.columns = ['HF2_PC_CHE','HF2_PPS_HAB',\
              'HF3_PC_CHE','HF3_PPS_HAB']
df_o.reset_index(inplace=True)
df_o.set_index(['country','year'],inplace=True)
df_o.tail()

                    country_code  HF2_PC_CHE  HF2_PPS_HAB  HF3_PC_CHE  \
country        year                                                     
United Kingdom 2014           UK        5.40       149.18       15.16   
               2015           UK        5.40       151.39       15.23   
               2016           UK        5.04       142.30       15.37   
               2017           UK        5.41       155.88       15.72   
               2018           UK        5.62       165.77       15.88   

                     HF3_PPS_HAB  
country        year               
United Kingdom 2014       419.03  
               2015       427.46  
               2016       434.36  
               2017       453.26  
               2018       468.46

# df = df.merge(df_c, how = 'left',\
#               on = ['country','year','sex','age_category'])
df = df.merge(df_o, how = 'left',\
              on = ['country','year'])
df.set_index(['nuts2','year','sex','age'],inplace=True)
df.tail()

                            country  population country_code  HF2_PC_CHE  \
nuts2 year sex age                                                         
UKM6  2018 M   99.0  United Kingdom        11.0           UK        5.62   
UKM7  2018 M   99.0  United Kingdom        40.0           UK        5.62   
UKM8  2018 M   99.0  United Kingdom        21.0           UK        5.62   
UKM9  2018 M   99.0  United Kingdom        17.0           UK        5.62   
UKN0  2018 M   99.0  United Kingdom        33.0           UK        5.62   

                     HF2_PPS_HAB  HF3_PC_CHE  HF3_PPS_HAB  
nuts2 year sex age                                         
UKM6  2018 M   99.0       165.77       15.88       468.46  
UKM7  2018 M   99.0       165.77       15.88       468.46  
UKM8  2018 M   99.0       165.77       15.88       468.46  
UKM9  2018 M   99.0       165.77       15.88       468.46  
UKN0  2018 M   99.0       165.77       15.88       468.46

ds = df.to_xarray()
ds.to_netcdf("./data/population_nuts_2.nc")
ds

<xarray.Dataset> Size: 38MB
Dimensions:       (nuts2: 313, year: 11, sex: 2, age: 99)
Coordinates:
  * nuts2         (nuts2) object 3kB 'AT11' 'AT12' 'AT13' ... 'UKM9' 'UKN0'
  * year          (year) object 88B '2008' '2009' '2010' ... '2017' '2018'
  * sex           (sex) object 16B 'F' 'M'
  * age           (age) float64 792B 1.0 2.0 3.0 4.0 5.0 ... 96.0 97.0 98.0 99.0
Data variables:
    country       (nuts2, year, sex, age) object 5MB 'Austria' ... 'United Ki...
    population    (nuts2, year, sex, age) float64 5MB 1.127e+03 ... 33.0
    country_code  (nuts2, year, sex, age) object 5MB 'AT' 'AT' ... 'UK' 'UK'
    HF2_PC_CHE    (nuts2, year, sex, age) float64 5MB 6.8 6.8 6.8 ... 5.62 5.62
    HF2_PPS_HAB   (nuts2, year, sex, age) float64 5MB 212.5 212.5 ... 165.8
    HF3_PC_CHE    (nuts2, year, sex, age) float64 5MB 18.24 18.24 ... 15.88
    HF3_PPS_HAB   (nuts2, year, sex, age) float64 5MB 570.5 570.5 ... 468.5

3.3. deaths

deaths: https://ec.europa.eu/eurostat/databrowser/view/demo_r_magec/default/table?lang=en

df = eurostat.get_data_df('demo_r_magec')
df.rename({'geo\\TIME_PERIOD':'geo'},inplace=True,axis=1)
age_range = ['Y1', 'Y10', 'Y11', 'Y12', 'Y13', 'Y14', 'Y15',
       'Y16', 'Y17', 'Y18', 'Y19', 'Y2', 'Y20', 'Y21', 'Y22', 'Y23',
       'Y24', 'Y25', 'Y26', 'Y27', 'Y28', 'Y29', 'Y3', 'Y30', 'Y31',
       'Y32', 'Y33', 'Y34', 'Y35', 'Y36', 'Y37', 'Y38', 'Y39', 'Y4',
       'Y40', 'Y41', 'Y42', 'Y43', 'Y44', 'Y45', 'Y46', 'Y47', 'Y48',
       'Y49', 'Y5', 'Y50', 'Y51', 'Y52', 'Y53', 'Y54', 'Y55', 'Y56',
       'Y57', 'Y58', 'Y59', 'Y6', 'Y60', 'Y61', 'Y62', 'Y63', 'Y64',
       'Y65', 'Y66', 'Y67', 'Y68', 'Y69', 'Y7', 'Y70', 'Y71', 'Y72',
       'Y73', 'Y74', 'Y75', 'Y76', 'Y77', 'Y78', 'Y79', 'Y8', 'Y80',
       'Y81', 'Y82', 'Y83', 'Y84', 'Y85', 'Y86', 'Y87', 'Y88', 'Y89',
       'Y9', 'Y90', 'Y91', 'Y92', 'Y93', 'Y94', 'Y95', 'Y96', 'Y97',
       'Y98', 'Y99']
df['country_code'] = [x[:2] for x in df['geo']]
df['country'] = df['country_code'].replace(eurostat_dictionary)
df = df[df.country.isin(EU_countries) & \
        (df.sex.isin(["M","F"])) & (df.age.isin(age_range))]
df.drop(['country','country_code'],axis=1,inplace=True)
df = pd.melt(df,id_vars=['geo','sex','age'],
                        value_vars=[str(y) for y in year_range],
                        var_name='year',value_name='deaths')
df['nuts'] = [len(x)-2 for x in df.geo]
df = df.query('nuts == 2')
df.drop('nuts',inplace=True, axis=1)
df.rename({'geo':'nuts2', 'age':'age string'},\
          axis=1,inplace=True)
df['age'] = [float(x.replace('Y','')) for x in df['age string']]
df.drop('age string',inplace=True, axis=1)

df.set_index(['nuts2','year','sex','age'],inplace=True)

df.head()

                    deaths
nuts2 year sex age        
AT11  2008 F   1.0     1.0
AT12  2008 F   1.0     3.0
AT13  2008 F   1.0     3.0
AT21  2008 F   1.0     0.0
AT22  2008 F   1.0     2.0

ds = df.to_xarray()
ds.to_netcdf("./data/deaths_nuts_2.nc")
ds

<xarray.Dataset> Size: 5MB
Dimensions:  (nuts2: 313, year: 11, sex: 2, age: 99)
Coordinates:
  * nuts2    (nuts2) object 3kB 'AT11' 'AT12' 'AT13' ... 'UKM8' 'UKM9' 'UKN0'
  * year     (year) object 88B '2008' '2009' '2010' ... '2016' '2017' '2018'
  * sex      (sex) object 16B 'F' 'M'
  * age      (age) float64 792B 1.0 2.0 3.0 4.0 5.0 ... 95.0 96.0 97.0 98.0 99.0
Data variables:
    deaths   (nuts2, year, sex, age) float64 5MB 1.0 0.0 0.0 ... 48.0 26.0 18.0

3.5. unmet needs

Self-reported unmet needs for medical examination by main reason declared and NUTS 2 regions

value
TOOEXP	Too expensive
TOOFAR	Too far to travel
NOTIME	No time
NOKNOW	Didn’t know any good doctor or specialist
WAITING	Waiting list
FEAR	Fear of doctor, hospital, examination or treatment
HOPING	Wanted to wait and see if problem got better on its own
OTH	Other reason

df = eurostat.get_data_df('hlth_silc_08_r')
df.rename({'geo\\TIME_PERIOD':'nuts2'},inplace=True,axis=1)
df.drop(["unit"],axis=1,inplace=True)
df['nuts'] = [len(x)-2 for x in df.nuts2]
df = df[df.nuts == 2]
df.drop('nuts',axis=1,inplace=True)
columns = ['FEAR','TOOEXP','TOOFAR','WAITING',\
           'HOPING', 'NOTIME', 'NOKNOW','OTH']
df=df[df.reason.isin(columns)]
df = pd.melt(df,id_vars=['nuts2','reason'],
                        value_vars=[str(y) for y in year_range],
                        var_name='year',\
             value_name='unmet needs')
df.set_index(['nuts2','year','reason'],inplace=True)

df = df.unstack()
df.columns = df.columns.droplevel()
df['UNMET'] = df[columns].sum(axis=1)
df.drop(['FEAR','TOOFAR','WAITING',\
           'HOPING', 'NOTIME', 'NOKNOW','OTH'],axis=1,inplace=True)
df.tail()

reason      TOOEXP  UNMET
nuts2 year               
SK04  2014     1.1    6.7
      2015     0.4    6.6
      2016     0.5    8.1
      2017     0.2    5.9
      2018     0.4    6.5

ds_unmet_needs = df.to_xarray()
ds_unmet_needs.to_netcdf("./data/unmet_needs_nuts_2.nc")
ds_unmet_needs

<xarray.Dataset> Size: 31kB
Dimensions:  (nuts2: 168, year: 11)
Coordinates:
  * nuts2    (nuts2) object 1kB 'AL01' 'AL02' 'AL03' ... 'SK02' 'SK03' 'SK04'
  * year     (year) object 88B '2008' '2009' '2010' ... '2016' '2017' '2018'
Data variables:
    TOOEXP   (nuts2, year) float64 15kB nan nan nan nan nan ... 0.4 0.5 0.2 0.4
    UNMET    (nuts2, year) float64 15kB 0.0 0.0 0.0 0.0 0.0 ... 6.6 8.1 5.9 6.5

4. Merging the data

ds_P = xr.open_dataset('./data/population_nuts_2.nc')
ds_P['year'] = (ds_P['year'].values).astype('int')
ds_M = xr.open_dataset('./data/deaths_nuts_2.nc')
ds_M['year'] = (ds_M['year'].values).astype('int')

ds_infant = xr.open_dataset("./data/infant_mortality_nuts_2.nc")
ds_infant['year'] = (ds_infant['year'].values).astype('int')
ds_physician = xr.open_dataset("./data/number_physicians_nuts_2.nc")
ds_physician['year'] = (ds_physician['year'].values).astype('int')
ds_deprivation = xr.open_dataset("./data/material_deprivation_nuts_2.nc")
ds_deprivation['year'] = (ds_deprivation['year'].values).astype('int')
ds_unmet = xr.open_dataset("./data/unmet_needs_nuts_2.nc")
ds_unmet['year'] = (ds_unmet['year'].values).astype('int')
ds = xr.merge([ds_P.drop_vars(["HF2_PC_CHE","HF2_PPS_HAB","HF3_PPS_HAB"]),ds_M,ds_deprivation,ds_unmet,ds_infant,ds_physician],join='left')
ds

<xarray.Dataset> Size: 60MB
Dimensions:                           (nuts2: 313, year: 11, sex: 2, age: 99)
Coordinates:
  * nuts2                             (nuts2) <U4 5kB 'AT11' 'AT12' ... 'UKN0'
  * year                              (year) int64 88B 2008 2009 ... 2017 2018
  * sex                               (sex) <U1 8B 'F' 'M'
  * age                               (age) float64 792B 1.0 2.0 ... 98.0 99.0
Data variables:
    country                           (nuts2, year, sex, age) <U14 38MB ...
    population                        (nuts2, year, sex, age) float64 5MB ...
    country_code                      (nuts2, year, sex, age) <U2 5MB ...
    HF3_PC_CHE                        (nuts2, year, sex, age) float64 5MB ...
    deaths                            (nuts2, year, sex, age) float64 5MB ...
    percentage_material_deprivation   (nuts2, year) float64 28kB nan nan ... nan
    TOOEXP                            (nuts2, year) float64 28kB nan nan ... nan
    UNMET                             (nuts2, year) float64 28kB nan nan ... nan
    infant mortality                  (nuts2, year) float64 28kB 3.2 0.9 ... 4.2
    number_physicians per inhabitant  (nuts2, year) float64 28kB 328.2 ... nan

ds["country"] = ds["country"].isel(year=0,sex=0,age=0)
ds = ds.assign_coords(country=("nuts2", ds.country.values))
ds = ds.transpose("age","sex","year","nuts2")
ds

<xarray.Dataset> Size: 22MB
Dimensions:                           (nuts2: 313, year: 11, sex: 2, age: 99)
Coordinates:
  * nuts2                             (nuts2) <U4 5kB 'AT11' 'AT12' ... 'UKN0'
  * year                              (year) int64 88B 2008 2009 ... 2017 2018
  * sex                               (sex) <U1 8B 'F' 'M'
  * age                               (age) float64 792B 1.0 2.0 ... 98.0 99.0
    country                           (nuts2) <U14 18kB 'Austria' ... 'United...
Data variables:
    population                        (age, sex, year, nuts2) float64 5MB ...
    country_code                      (age, sex, year, nuts2) <U2 5MB ...
    HF3_PC_CHE                        (age, sex, year, nuts2) float64 5MB ...
    deaths                            (age, sex, year, nuts2) float64 5MB ...
    percentage_material_deprivation   (year, nuts2) float64 28kB nan nan ... nan
    TOOEXP                            (year, nuts2) float64 28kB nan nan ... nan
    UNMET                             (year, nuts2) float64 28kB nan nan ... nan
    infant mortality                  (year, nuts2) float64 28kB 3.2 4.7 ... 4.2
    number_physicians per inhabitant  (year, nuts2) float64 28kB 328.2 ... nan

ds.to_netcdf("./data/oop_health_data.nc")

5. Data extensions

5.1. transport accidents

https://ec.europa.eu/eurostat/databrowser/view/tran_r_acci/default/table?lang=en

import pandas as pd
import eurostat

df = eurostat.get_data_df("tran_r_acci")

df = df.rename(
    columns={
        "geo\\TIME_PERIOD": "geo",
    }
)

df['country_code'] = [x[:2] for x in df['geo']]
df['country'] = df['country_code'].replace(eurostat_dictionary)
df = df[df.country.isin(EU_countries) & \
        (df.victim.isin(["KIL"])) & (df.unit == 'P_MHAB')]
# df.drop(['country_code'],axis=1,inplace=True)
df = pd.melt(df,id_vars=['geo','country','country_code'],
                        value_vars=[str(y) for y in year_range],
                        var_name='year',value_name='killed_per_million_inhabitants')
df['nuts'] = [len(x)-2 for x in df.geo]
df = df.query('nuts == 2')
df.drop(['nuts', 'country'],inplace=True, axis=1)
df.rename({'geo':'nuts2'},inplace=True,axis=1)
# df.set_index(['nuts2','year'],inplace=True)

print(df.head())

# ds_traffic_mortality = df.to_xarray()
# ds_traffic_mortality.to_netcdf("./data/extensions/traffic_mortality_nuts_2.nc")
# ds_traffic_mortality

  nuts2 country_code  year  killed_per_million_inhabitants
2  AT11           AT  2008                           103.0
3  AT12           AT  2008                           109.0
4  AT13           AT  2008                            16.0
6  AT21           AT  2008                           100.0
7  AT22           AT  2008                            91.0

df[df.nuts2=="NO04"]

     nuts2 country_code  year  killed_per_million_inhabitants
311   NO04           NO  2008                            57.0
751   NO04           NO  2009                            42.0
1191  NO04           NO  2010                            38.0
1631  NO04           NO  2011                            26.0
2071  NO04           NO  2012                            36.0
2511  NO04           NO  2013                            39.0
2951  NO04           NO  2014                            24.0
3391  NO04           NO  2015                            26.0
3831  NO04           NO  2016                            34.0
4271  NO04           NO  2017                            17.0
4711  NO04           NO  2018                            24.0

5.2. country wide devices for medical imaging

https://ec.europa.eu/eurostat/databrowser/view/hlth_rs_medim/default/table?lang=en
at country level, not NUTS 2 level
unit: per 100k inhabitants

import pandas as pd
import eurostat

df_2 = eurostat.get_data_df("hlth_rs_medim")

df_2 = df_2.rename(
    columns={
        "geo\\TIME_PERIOD": "geo",
    }
)

df_2['country_code'] = [x[:2] for x in df_2['geo']]
df_2['country'] = df_2['country_code'].replace(eurostat_dictionary)
df_2 = df_2[df_2.country.isin(EU_countries) &\
        (df_2.unit == 'P_HTHAB')  & (df_2.icha_hp=='HP1_3')] 
df_2.drop(['geo'],axis=1,inplace=True)
df_2 = pd.melt(df_2,id_vars=['country_code','country','facility'],
                        value_vars=[str(y) for y in year_range],
                        var_name='year',value_name='number_devices')

# df_2.drop(['country'],inplace=True, axis=1)
df_2.set_index(['country_code','country','year','facility'],inplace=True)
df_2 = df_2.unstack()
df_2.reset_index(inplace=True)
df_2.columns = [' '.join(col).strip() for col in df_2.columns.values]
col = ['number_devices CT_SCAN', 'number_devices GAMMA',
       'number_devices MAMMO', 'number_devices MRI', 'number_devices PET_SCAN',
       'number_devices RAD']
df_2['number_devices'] = df_2[col].sum(axis=1)
df_2.drop(col,inplace=True,axis=1)
# df_2.set_index(['country','year'],inplace=True)

print(df_2.head())

# ds_devices = df.to_xarray()
# ds_devices.to_netcdf("./data/extensions/devices_country.nc")
# ds_devices

  country_code  country  year  number_devices
0           AT  Austria  2008            6.68
1           AT  Austria  2009            6.70
2           AT  Austria  2010            6.78
3           AT  Austria  2011            6.70
4           AT  Austria  2012            6.79

df = df.merge(df_2, how='left',on=['country_code','year'])

df.drop(["country_code","country"],axis=1,inplace=True)
df.set_index(['nuts2','year'],inplace=True)
df.head()

            killed_per_million_inhabitants  number_devices
nuts2 year                                                
AT11  2008                           103.0            6.68
AT12  2008                           109.0            6.68
AT13  2008                            16.0            6.68
AT21  2008                           100.0            6.68
AT22  2008                            91.0            6.68

df.loc[["NO01","NO02","NO04","NO05"]]

            killed_per_million_inhabitants  number_devices
nuts2 year                                                
NO01  2008                            22.0            0.00
      2009                            26.0            0.00
      2010                            20.0            0.00
      2011                            11.0            0.00
      2012                            17.0            0.00
      2013                            14.0            0.00
      2014                            17.0            0.00
      2015                            11.0            0.00
      2016                            13.0            4.11
      2017                            10.0            4.53
      2018                             9.0            5.44
NO02  2008                            91.0            0.00
      2009                            64.0            0.00
      2010                            80.0            0.00
      2011                            77.0            0.00
      2012                            53.0            0.00
      2013                            47.0            0.00
      2014                            42.0            0.00
      2015                            44.0            0.00
      2016                            42.0            4.11
      2017                            26.0            4.53
      2018                            31.0            5.44
NO04  2008                            57.0            0.00
      2009                            42.0            0.00
      2010                            38.0            0.00
      2011                            26.0            0.00
      2012                            36.0            0.00
      2013                            39.0            0.00
      2014                            24.0            0.00
      2015                            26.0            0.00
      2016                            34.0            4.11
      2017                            17.0            4.53
      2018                            24.0            5.44
NO05  2008                            69.0            0.00
      2009                            56.0            0.00
      2010                            39.0            0.00
      2011                            37.0            0.00
      2012                            29.0            0.00
      2013                            46.0            0.00
      2014                            22.0            0.00
      2015                            16.0            0.00
      2016                            18.0            4.11
      2017                            26.0            4.53
      2018                            23.0            5.44

print(df.head())
ds_accidents_devices = df.to_xarray()
ds_accidents_devices.to_netcdf("./data/extensions/traffic_mortality_nuts_2.nc")
# ds_accidents_devices

            killed_per_million_inhabitants  number_devices
nuts2 year                                                
AT11  2008                           103.0            6.68
AT12  2008                           109.0            6.68
AT13  2008                            16.0            6.68
AT21  2008                           100.0            6.68
AT22  2008                            91.0            6.68

5.3. GDP per capita

https://ec.europa.eu/eurostat/databrowser/view/nama_10r_2gdp/default/table?lang=en

import pandas as pd
import eurostat

df = eurostat.get_data_df("nama_10r_2gdp")

df = df.rename(
    columns={
        "geo\\TIME_PERIOD": "geo",
    }
)

df['country_code'] = [x[:2] for x in df['geo']]
df['country'] = df['country_code'].replace(eurostat_dictionary)
df = df[df.country.isin(EU_countries) & \
        (df.unit == 'PPS_EU27_2020_HAB')] # Purchasing power standard (PPS, EU27 from 2020), per inhabitant
df.drop(['country_code'],axis=1,inplace=True)
df = pd.melt(df,id_vars=['geo','country'],
                        value_vars=[str(y) for y in year_range],
                        var_name='year',value_name='GDP_per_capita_PPS')
df['nuts'] = [len(x)-2 for x in df.geo]
df = df.query('nuts == 2')
df.drop(['nuts', 'country'],inplace=True, axis=1)
df.rename({'geo':'nuts2'},inplace=True,axis=1)
df.set_index(['nuts2','year'],inplace=True)

print(df.head())

ds_GDP_per_capita = df.to_xarray()
ds_GDP_per_capita.to_netcdf("./data/extensions/GDP_per_capita_nuts_2.nc")
ds_GDP_per_capita

            GDP_per_capita_PPS
nuts2 year                    
AT11  2008             20100.0
AT12  2008             26100.0
AT13  2008             41000.0
AT21  2008             27700.0
AT22  2008             28600.0
<xarray.Dataset> Size: 25kB
Dimensions:             (nuts2: 257, year: 11)
Coordinates:
  * nuts2               (nuts2) object 2kB 'AT11' 'AT12' ... 'SK03' 'SK04'
  * year                (year) object 88B '2008' '2009' '2010' ... '2017' '2018'
Data variables:
    GDP_per_capita_PPS  (nuts2, year) float64 23kB 2.01e+04 ... 1.54e+04

5.4. unemployment

https://ec.europa.eu/eurostat/databrowser/view/lfst_r_lfur2gan/default/table

import pandas as pd
import eurostat

df = eurostat.get_data_df("lfst_r_lfur2gan")

df = df.rename(
    columns={
        "geo\\TIME_PERIOD": "geo",
    }
)

df['country_code'] = [x[:2] for x in df['geo']]
df['country'] = df['country_code'].replace(eurostat_dictionary)
df = df[df.country.isin(EU_countries) & (df.citizen=='TOTAL') &\
        (df.unit == 'PC')  & (df.sex=='T') & (df.age=='Y15-74')] 
df.drop(['country_code'],axis=1,inplace=True)
df = pd.melt(df,id_vars=['geo','country'],
                        value_vars=[str(y) for y in year_range],
                        var_name='year',value_name='unemployment')
df['nuts'] = [len(x)-2 for x in df.geo]
df = df.query('nuts == 2')
df.drop(['nuts', 'country'],inplace=True, axis=1)
df.rename({'geo':'nuts2'},inplace=True,axis=1)
df.set_index(['nuts2','year'],inplace=True)

print(df.head())

ds_unemployment = df.to_xarray()
ds_unemployment.to_netcdf("./data/extensions/unemployment_nuts_2.nc")
ds_unemployment

            unemployment
nuts2 year              
AT11  2008           4.2
AT12  2008           3.7
AT13  2008           7.3
AT21  2008           3.6
AT22  2008           3.8
<xarray.Dataset> Size: 31kB
Dimensions:       (nuts2: 319, year: 11)
Coordinates:
  * nuts2         (nuts2) object 3kB 'AT11' 'AT12' 'AT13' ... 'UKM9' 'UKN0'
  * year          (year) object 88B '2008' '2009' '2010' ... '2017' '2018'
Data variables:
    unemployment  (nuts2, year) float64 28kB 4.2 5.0 4.0 3.8 ... 6.1 5.7 4.6 3.6

5.5. education

https://ec.europa.eu/eurostat/databrowser/view/tgs00109/default/table?lang=en&category=t_reg.t_reg_educ
data starts from 2013 onward (not 2008)

import pandas as pd
import eurostat

df = eurostat.get_data_df("tgs00109")

df = df.rename(
    columns={
        "geo\\TIME_PERIOD": "geo",
    }
)

df['country_code'] = [x[:2] for x in df['geo']]
df['country'] = df['country_code'].replace(eurostat_dictionary)
df = df[df.country.isin(EU_countries) &\
        (df.unit == 'PC')  & (df.sex=='T')] 
df.drop(['country_code'],axis=1,inplace=True)
df = pd.melt(df,id_vars=['geo','country'],
                        value_vars=[str(y) for y in np.arange(2013,2019,1)],
                        var_name='year',value_name='tertiary_education')
df['nuts'] = [len(x)-2 for x in df.geo]
df = df.query('nuts == 2')
df.drop(['nuts', 'country'],inplace=True, axis=1)
df.rename({'geo':'nuts2'},inplace=True,axis=1)
df.set_index(['nuts2','year'],inplace=True)

print(df.head())

ds_education = df.to_xarray()
ds_education.to_netcdf("./data/extensions/education_nuts_2.nc")
ds_education

            tertiary_education
nuts2 year                    
AT11  2013                15.8
AT12  2013                17.6
AT13  2013                30.6
AT21  2013                17.3
AT22  2013                17.5
<xarray.Dataset> Size: 17kB
Dimensions:             (nuts2: 308, year: 6)
Coordinates:
  * nuts2               (nuts2) object 2kB 'AT11' 'AT12' ... 'UKM9' 'UKN0'
  * year                (year) object 48B '2013' '2014' '2015' ... '2017' '2018'
Data variables:
    tertiary_education  (nuts2, year) float64 15kB 15.8 25.7 28.5 ... 35.6 37.2

5.6. merging into extended dataset

ds_G = xr.open_dataset('./data/extensions/GDP_per_capita_nuts_2.nc')
ds_E = xr.open_dataset('./data/extensions/education_nuts_2.nc')
ds_T = xr.open_dataset('./data/extensions/traffic_mortality_nuts_2.nc')
ds_U = xr.open_dataset('./data/extensions/unemployment_nuts_2.nc')

ds_extended = xr.merge([ds_T,ds_G,ds_E,ds_U],join='left')
ds_extended['year'] = (ds_extended['year'].values).astype('int')
ds_extended = ds_extended.transpose("year","nuts2")
# ds_extended['number_devices'] = ds_extended['number_devices'].isel(country=0)
# ds_extended['country_code'] = ds_extended['country_code'].isel(nuts2=0)
# ds_extended['killed_per_million_inhabitants'] = ds_extended['killed_per_million_inhabitants'].isel(country=0)

ds_extended

<xarray.Dataset> Size: 137kB
Dimensions:                         (nuts2: 300, year: 11)
Coordinates:
  * nuts2                           (nuts2) <U4 5kB 'AT11' 'AT12' ... 'UKN0'
  * year                            (year) int64 88B 2008 2009 ... 2017 2018
Data variables:
    killed_per_million_inhabitants  (year, nuts2) float64 26kB ...
    number_devices                  (year, nuts2) float64 26kB ...
    GDP_per_capita_PPS              (year, nuts2) float64 26kB 2.01e+04 ... nan
    tertiary_education              (year, nuts2) float64 26kB nan nan ... 37.2
    unemployment                    (year, nuts2) float64 26kB 4.2 3.7 ... 3.6

ds_extended.to_netcdf("./data/extensions/extended_data.nc")