Getting Eurostat data

Table of Contents

1. Introduction

This file describes how we get the data for our project on the health effects of demand side cost sharing.

In this file we explain how we get the data and from where and at the same time we generate the data. By generating the data in this way, a change in the data by Eurostat will lead to (slightly) different results.

The dataset itself is too big for github and can be found on DataverseNL.

2. Preamble code

Run this code first before running the code of particular data sets. It imports the relevant libraries.

import numpy as np
import pandas as pd
from country_codes import oecd_dictionary, eurostat_dictionary
import eurostat
import xarray as xr
import matplotlib.pyplot as plt

EU_countries = ['Belgium', 'Bulgaria', 'Czechia', 'Denmark',
'Germany (until 1990 former territory of the FRG)', 'Germany','Estonia',
'Ireland', 'Greece', 'Spain', 'France', 'Croatia', 'Italy',
'Cyprus', 'Latvia', 'Lithuania', 'Luxembourg', 'Hungary', 'Malta',
'Netherlands', 'Austria', 'Poland', 'Portugal', 'Romania',
'Slovenia', 'Slovakia', 'Finland', 'Sweden', 'United Kingdom',
'Iceland', 'Liechtenstein', 'Norway', 'Switzerland',
'Bosnia and Herzegovina']

We first describe how we get the data for each variable. Each of these variables is saved in a separate dataset. The final section contains the code that merges the data for these variables into one (big) dataset.

3. Data

First, we fetch the variables at the country level from the Eurostat website. Then we move to the NUTS 2 regional data.

3.1. country data

3.1.1. OOP per country

We want a measure of "how generous" a healthcare system is in terms of oop-payments. This feature of the healthcare system tends to be defined at the country level. We use household oop payment and expenditure on voluntary payment schemes (for things that are not covered by basic or mandatory insurance). The paper uses mainly PC_CHE as unit of measurement.

abbreviation name
HF2 Voluntary health care payment schemes
HF3 Household out-of-pocket payment
PC_CHE Percentual share of total current health expenditure (CHE)
PPS_HAB Purchasing power standard (PPS) per inhabitant
df = eurostat.get_data_df('hlth_sha11_hf')
df.rename({'geo\\time':'geo'},inplace=True,axis=1)
df['country_code'] = [x[:2] for x in df['geo']]
df['country'] = df['country_code'].replace(eurostat_dictionary)
df = df[df.country.isin(EU_countries)  & \
        (df.icha11_hf.isin(["HF2", "HF3"])) &\
        (df.unit.isin([ 'PC_CHE','PPS_HAB']))]
df.drop(["geo"],axis=1,inplace=True)

df = pd.melt(df,id_vars=['country','country_code',\
                         'unit','icha11_hf'],
                        value_vars=np.arange(1992,2019),
                        var_name='year',\
             value_name='health_expenditure')
df.set_index(['country','country_code',\
              'year','unit','icha11_hf'],inplace=True)
df = df.unstack()
df.columns = df.columns.droplevel()
df = df.unstack()
df.columns = df.columns.droplevel()
df.columns = ['HF2_PC_CHE','HF2_PPS_HAB',\
              'HF3_PC_CHE','HF3_PPS_HAB']
df.reset_index(inplace=True)
df.set_index(['country','country_code','year'],inplace=True)
df.tail()
HF2_PC_CHE HF2_PPS_HAB HF3_PC_CHE HF3_PPS_HAB
country country_code year
United Kingdom UK 2014 5.40 155.48 15.16 436.73
2015 5.40 161.07 15.23 454.78
2016 5.04 148.46 15.37 453.16
2017 5.41 160.78 15.72 467.53
2018 5.62 172.06 15.88 486.25
ds = df.to_xarray()
ds.to_netcdf("./data/country/out_of_pocket.nc")
ds
<xarray.Dataset>
Dimensions:       (country: 33, country_code: 33, year: 27)
Coordinates:
  * country       (country) object 'Austria' 'Belgium' ... 'United Kingdom'
  * country_code  (country_code) object 'AT' 'BA' 'BE' 'BG' ... 'SI' 'SK' 'UK'
  * year          (year) int64 1992 1993 1994 1995 1996 ... 2015 2016 2017 2018
Data variables:
    HF2_PC_CHE    (country, country_code, year) float64 nan nan ... 5.41 5.62
    HF2_PPS_HAB   (country, country_code, year) float64 nan nan ... 160.8 172.1
    HF3_PC_CHE    (country, country_code, year) float64 nan nan ... 15.72 15.88
    HF3_PPS_HAB   (country, country_code, year) float64 nan nan ... 467.5 486.2

3.1.2. healthcare expenditure

abbreviation variable name
TOT_HC Current health care expenditure (CHE)
EUR_HAB Euro per inhabitant
df = eurostat.get_data_df('hlth_sha11_hc')
df.rename({'geo\\time':'geo'},inplace=True,axis=1)
df['country_code'] = [x[:2] for x in df['geo']]
df['country'] = df['country_code'].replace(eurostat_dictionary)
df = df[df.country.isin(EU_countries)  & \
        (df.icha11_hc.isin(["TOT_HC"])) & \
        (df.unit.isin(['EUR_HAB']))]
df.drop(["geo"],axis=1,inplace=True)

df = pd.melt(df,id_vars=['country','country_code',\
                         'unit','icha11_hc'],
                        value_vars=np.arange(1992,2020),
                        var_name='year',\
             value_name='health_expenditure')
df.set_index(['country','country_code','year',\
              'unit','icha11_hc'],inplace=True)
df = df.unstack()
df.columns = df.columns.droplevel()
df = df.unstack()
df.columns = df.columns.droplevel()
df.reset_index(inplace=True)
df.rename({'EUR_HAB':'health expenditure per capita'},\
          axis=1,inplace=True)
df.set_index(['country','country_code','year'],inplace=True)
df.head()
unit health expenditure per capita
country country_code year
Austria AT 1992 NaN
1993 NaN
1994 NaN
1995 NaN
1996 NaN
ds = df.to_xarray()
ds.to_netcdf("./data/country/healthcare_expenditure.nc")
ds
<xarray.Dataset>
Dimensions:                        (country: 33, country_code: 33, year: 28)
Coordinates:
  * country                        (country) object 'Austria' ... 'United Kin...
  * country_code                   (country_code) object 'AT' 'BA' ... 'SK' 'UK'
  * year                           (year) int64 1992 1993 1994 ... 2018 2019
Data variables:
    health expenditure per capita  (country, country_code, year) float64 nan ...

3.1.3. self-perceived health

See section 3.2.1.1 below where we merge these data on age with mortality.

3.2. nuts 2

3.2.1. crude number of deaths and population by age and sex and self-perceived health

df = eurostat.get_data_df('demo_r_d2jan')
df.rename({'geo\\time':'geo'},inplace=True,axis=1)
age_range = ['Y1', 'Y10', 'Y11', 'Y12', 'Y13', 'Y14', 'Y15',
       'Y16', 'Y17', 'Y18', 'Y19', 'Y2', 'Y20', 'Y21', 'Y22', 'Y23',
       'Y24', 'Y25', 'Y26', 'Y27', 'Y28', 'Y29', 'Y3', 'Y30', 'Y31',
       'Y32', 'Y33', 'Y34', 'Y35', 'Y36', 'Y37', 'Y38', 'Y39', 'Y4',
       'Y40', 'Y41', 'Y42', 'Y43', 'Y44', 'Y45', 'Y46', 'Y47', 'Y48',
       'Y49', 'Y5', 'Y50', 'Y51', 'Y52', 'Y53', 'Y54', 'Y55', 'Y56',
       'Y57', 'Y58', 'Y59', 'Y6', 'Y60', 'Y61', 'Y62', 'Y63', 'Y64',
       'Y65', 'Y66', 'Y67', 'Y68', 'Y69', 'Y7', 'Y70', 'Y71', 'Y72',
       'Y73', 'Y74', 'Y75', 'Y76', 'Y77', 'Y78', 'Y79', 'Y8', 'Y80',
       'Y81', 'Y82', 'Y83', 'Y84', 'Y85', 'Y86', 'Y87', 'Y88', 'Y89',
       'Y9', 'Y90', 'Y91', 'Y92', 'Y93', 'Y94', 'Y95', 'Y96', 'Y97',
       'Y98', 'Y99']
df['country_code'] = [x[:2] for x in df['geo']]
df['country'] = df['country_code'].replace(eurostat_dictionary)
df = df[df.country.isin(EU_countries) & \
        (df.sex.isin(["M","F"])) & (df.age.isin(age_range))]
df.drop(['country_code'],axis=1,inplace=True)
df = pd.melt(df,id_vars=['geo','country','sex','age'],
                        value_vars=np.arange(1990,2021),
                        var_name='year',value_name='population')
df['nuts'] = [len(x)-2 for x in df.geo]
df = df.query('nuts == 2')
df.drop('nuts',inplace=True, axis=1)
df.rename({'geo':'nuts2', 'age':'age string'},\
          axis=1,inplace=True)
df['age'] = [float(x.replace('Y','')) for x in df['age string']]
df.drop('age string',inplace=True, axis=1)

#df.set_index(['nuts2','year','sex','age'],inplace=True)

df.head()
nuts2 country sex year population age
2 AT11 Austria F 1990 1345.0 1.0
3 AT12 Austria F 1990 8038.0 1.0
4 AT13 Austria F 1990 7240.0 1.0
6 AT21 Austria F 1990 3042.0 1.0
7 AT22 Austria F 1990 6491.0 1.0
@np.vectorize
def age_to_age_category(age):
    if age < 16:
        cat = 'Y0-15'
    elif age < 25:
        cat = 'Y16-24'
    elif age < 35:
        cat = 'Y25-34'
    elif age < 45:
        cat = 'Y35-44'
    elif age < 55:
        cat = 'Y45-54'
    elif age < 65:
        cat = 'Y55-64'
    elif age < 75:
        cat = 'Y65-74'
    elif age < 85:
        cat = 'Y75-84'
    else:
        cat = 'Y_GE85'
    return cat
df['age_category'] = age_to_age_category(df.age)
df.head()
  nuts2  country sex  year  population  age age_category
2  AT11  Austria   F  1990      1345.0  1.0        Y0-15
3  AT12  Austria   F  1990      8038.0  1.0        Y0-15
4  AT13  Austria   F  1990      7240.0  1.0        Y0-15
6  AT21  Austria   F  1990      3042.0  1.0        Y0-15
7  AT22  Austria   F  1990      6491.0  1.0        Y0-15
  1. self-perceived health
    ages = ['Y16-24', 'Y25-34', 'Y35-44', 'Y45-54', 'Y55-64',
            'Y65-74', 'Y75-84', 'Y_GE85']
    df_c = eurostat.get_data_df('hlth_silc_02')
    df_c.rename({'geo\\time':'geo'},inplace=True,axis=1)
    df_c['country'] = df_c['geo'].replace(eurostat_dictionary)
    df_c = df_c[df_c.country.isin(EU_countries) & (df_c.sex.isin(["M","F"]) ) & (df_c.levels.isin(["B_VB"])) & (df_c.isced11 == "TOTAL") & df_c.age.isin(ages)]
    df_c.drop(["unit","isced11","geo","levels"],axis=1,inplace=True)
    df_c = pd.melt(df_c,id_vars=['country','sex','age'],
                            value_vars=np.arange(2008,2023),
                            var_name='year',value_name='bad_self_perceived_health')
    # df_c.set_index(['country','year','sex','mortalit'],inplace=True)
    # df_c = df_c.unstack()
    # df_c.reset_index(inplace=True)
    # df_c.columns = [' '.join(col).strip() for col in df_c.columns.values]
    df_c.rename({'age':'age_category'},inplace=True,axis=1)
    # df_c.to_csv('./data/preventable_mortality.csv')
    df_c.head()
    
           country sex age_category  year  bad_self_perceived_health
    0      Austria   F       Y16-24  2008                        1.0
    1      Belgium   F       Y16-24  2008                        1.8
    2     Bulgaria   F       Y16-24  2008                        0.9
    3  Switzerland   F       Y16-24  2008                        0.9
    4       Cyprus   F       Y16-24  2008                        0.2
    
    df = df.merge(df_c, how = 'left',\
                  on = ['country','year','sex','age_category'])
    df.set_index(['nuts2','year','sex','age'],inplace=True)
    df.tail()
    
                                country  population age_category  \
    nuts2 year sex age                                             
    UKM6  2020 M   99.0  United Kingdom         NaN       Y_GE85   
    UKM7  2020 M   99.0  United Kingdom         NaN       Y_GE85   
    UKM8  2020 M   99.0  United Kingdom         NaN       Y_GE85   
    UKM9  2020 M   99.0  United Kingdom         NaN       Y_GE85   
    UKN0  2020 M   99.0  United Kingdom         NaN       Y_GE85   
    
                         bad_self_perceived_health  
    nuts2 year sex age                              
    UKM6  2020 M   99.0                        NaN  
    UKM7  2020 M   99.0                        NaN  
    UKM8  2020 M   99.0                        NaN  
    UKM9  2020 M   99.0                        NaN  
    UKN0  2020 M   99.0                        NaN  
    
    ds = df.to_xarray()
    ds.to_netcdf("./data/population_nuts_2.nc")
    ds
    
    <xarray.Dataset>
    Dimensions:                    (age: 99, nuts2: 306, sex: 2, year: 31)
    Coordinates:
      * nuts2                      (nuts2) object 'AT11' 'AT12' ... 'UKM9' 'UKN0'
      * year                       (year) int64 1990 1991 1992 ... 2018 2019 2020
      * sex                        (sex) object 'F' 'M'
      * age                        (age) float64 1.0 2.0 3.0 4.0 ... 97.0 98.0 99.0
    Data variables:
        country                    (nuts2, year, sex, age) object 'Austria' ... '...
        population                 (nuts2, year, sex, age) float64 1.345e+03 ... nan
        age_category               (nuts2, year, sex, age) object 'Y0-15' ... 'Y_...
        bad_self_perceived_health  (nuts2, year, sex, age) float64 nan nan ... nan
    df = eurostat.get_data_df('demo_r_magec')
    df.rename({'geo\\time':'geo'},inplace=True,axis=1)
    age_range = ['Y1', 'Y10', 'Y11', 'Y12', 'Y13', 'Y14', 'Y15',
           'Y16', 'Y17', 'Y18', 'Y19', 'Y2', 'Y20', 'Y21', 'Y22', 'Y23',
           'Y24', 'Y25', 'Y26', 'Y27', 'Y28', 'Y29', 'Y3', 'Y30', 'Y31',
           'Y32', 'Y33', 'Y34', 'Y35', 'Y36', 'Y37', 'Y38', 'Y39', 'Y4',
           'Y40', 'Y41', 'Y42', 'Y43', 'Y44', 'Y45', 'Y46', 'Y47', 'Y48',
           'Y49', 'Y5', 'Y50', 'Y51', 'Y52', 'Y53', 'Y54', 'Y55', 'Y56',
           'Y57', 'Y58', 'Y59', 'Y6', 'Y60', 'Y61', 'Y62', 'Y63', 'Y64',
           'Y65', 'Y66', 'Y67', 'Y68', 'Y69', 'Y7', 'Y70', 'Y71', 'Y72',
           'Y73', 'Y74', 'Y75', 'Y76', 'Y77', 'Y78', 'Y79', 'Y8', 'Y80',
           'Y81', 'Y82', 'Y83', 'Y84', 'Y85', 'Y86', 'Y87', 'Y88', 'Y89',
           'Y9', 'Y90', 'Y91', 'Y92', 'Y93', 'Y94', 'Y95', 'Y96', 'Y97',
           'Y98', 'Y99']
    df['country_code'] = [x[:2] for x in df['geo']]
    df['country'] = df['country_code'].replace(eurostat_dictionary)
    df = df[df.country.isin(EU_countries) & \
            (df.sex.isin(["M","F"])) & (df.age.isin(age_range))]
    df.drop(['country','country_code'],axis=1,inplace=True)
    df = pd.melt(df,id_vars=['geo','sex','age'],
                            value_vars=np.arange(1990,2020),
                            var_name='year',value_name='deaths')
    df['nuts'] = [len(x)-2 for x in df.geo]
    df = df.query('nuts == 2')
    df.drop('nuts',inplace=True, axis=1)
    df.rename({'geo':'nuts2', 'age':'age string'},\
              axis=1,inplace=True)
    df['age'] = [float(x.replace('Y','')) for x in df['age string']]
    df.drop('age string',inplace=True, axis=1)
    
    df.set_index(['nuts2','year','sex','age'],inplace=True)
    
    df.head()
    
    deaths
    nuts2 year sex age
    AT11 1990 F 1.0 0.0
    AT12 1990 F 1.0 7.0
    AT13 1990 F 1.0 4.0
    AT21 1990 F 1.0 1.0
    AT22 1990 F 1.0 5.0
    ds = df.to_xarray()
    ds.to_netcdf("./data/deaths_nuts_2.nc")
    ds
    
    <xarray.Dataset>
    Dimensions:  (age: 99, nuts2: 305, sex: 2, year: 30)
    Coordinates:
      * nuts2    (nuts2) object 'AT11' 'AT12' 'AT13' 'AT21' ... 'UKM8' 'UKM9' 'UKN0'
      * year     (year) int64 1990 1991 1992 1993 1994 ... 2015 2016 2017 2018 2019
      * sex      (sex) object 'F' 'M'
      * age      (age) float64 1.0 2.0 3.0 4.0 5.0 6.0 ... 95.0 96.0 97.0 98.0 99.0
    Data variables:
        deaths   (nuts2, year, sex, age) float64 0.0 0.0 1.0 1.0 ... nan nan nan nan

3.2.2. mortality lagged per age

ds_population = xr.open_dataset('./data/population_nuts_2.nc')
ds_deaths =  xr.open_dataset('./data/deaths_nuts_2.nc')

ds_mortality = xr.merge([ds_population, ds_deaths])
ds_mortality
<xarray.Dataset>
Dimensions:     (age: 99, nuts2: 305, sex: 2, year: 31)
Coordinates:
  * year        (year) int64 1990 1991 1992 1993 1994 ... 2017 2018 2019 2020
  * nuts2       (nuts2) object 'AT11' 'AT12' 'AT13' ... 'UKM8' 'UKM9' 'UKN0'
  * sex         (sex) object 'F' 'M'
  * age         (age) float64 1.0 2.0 3.0 4.0 5.0 ... 95.0 96.0 97.0 98.0 99.0
Data variables:
    population  (nuts2, year, sex, age) float64 ...
    deaths      (nuts2, year, sex, age) float64 0.0 0.0 1.0 1.0 ... nan nan nan
df_mortality = ds_mortality.to_dataframe()
# drop rows where deaths > population at Jan. 1st
# (e.g. because people moved in the year)
df_mortality = df_mortality[(df_mortality.population >= \
                             df_mortality.deaths) &\
                            (df_mortality.population > 0)]
df_mortality['mortality'] = df_mortality['deaths']/\
    df_mortality['population']
df_mortality.reset_index(inplace=True)
df_mortality.rename({'level_3':'year'},\
                    inplace=True,axis=1)
df_mortality['year'] = pd.to_datetime(df_mortality['year'],\
                                      format='%Y')
df_mortality.head()
age nuts2 sex year population deaths mortality
0 1.0 AT11 F 1990-01-01 1345.0 0.0 0.000000
1 1.0 AT11 F 1991-01-01 1371.0 0.0 0.000000
2 1.0 AT11 F 1992-01-01 1436.0 1.0 0.000696
3 1.0 AT11 F 1993-01-01 1372.0 0.0 0.000000
4 1.0 AT11 F 1994-01-01 1349.0 1.0 0.000741
new_column = (df_mortality.set_index(['year','age','nuts2','sex']).\
              groupby(['age','nuts2','sex'])['mortality']\
              .shift(1)).to_frame()
new_column.reset_index(inplace=True)
new_column.age += 1
new_column['year'] = new_column['year'].dt.year
new_column.rename({'mortality':'lagged_mortality'},\
                  inplace=True,axis=1)
new_column.set_index(['year','age','nuts2','sex'],\
                     inplace=True)
new_column.head()
lagged_mortality
year age nuts2 sex
1990 2.0 AT11 F NaN
1991 2.0 AT11 F 0.000000
1992 2.0 AT11 F 0.000000
1993 2.0 AT11 F 0.000696
1994 2.0 AT11 F 0.000000
ds = new_column.to_xarray()
ds.to_netcdf("./data/lagged_mortality_nuts_2.nc")
ds
<xarray.Dataset>
Dimensions:           (age: 99, nuts2: 297, sex: 2, year: 30)
Coordinates:
  * year              (year) int64 1990 1991 1992 1993 ... 2016 2017 2018 2019
  * age               (age) float64 2.0 3.0 4.0 5.0 6.0 ... 97.0 98.0 99.0 100.0
  * nuts2             (nuts2) object 'AT11' 'AT12' 'AT13' ... 'UKM9' 'UKN0'
  * sex               (sex) object 'F' 'M'
Data variables:
    lagged_mortality  (year, age, nuts2, sex) float64 nan nan nan ... nan nan

3.2.3. Poverty measures

  1. Severe material deprivation rate by NUTS 2 regions [TGS00104]
    df = eurostat.get_data_df('tgs00104')
    df.rename({'geo\\time':'nuts2'},inplace=True,axis=1)
    df = df[(df.unit == "PC")]
    df.drop(["unit"],axis=1,inplace=True)
    df['nuts'] = [len(x)-2 for x in df.nuts2]
    df = df[df.nuts == 2]
    df.drop('nuts',axis=1,inplace=True)
    
    df = pd.melt(df,id_vars=['nuts2'],
                            value_vars=np.arange(2009,2021),
                            var_name='year',\
                 value_name='percentage_material_deprivation')
    df.set_index(['nuts2','year'],inplace=True)
    df.head()
    
    percentage_material_deprivation
    nuts2 year
    AL01 2009 NaN
    AL02 2009 NaN
    AL03 2009 NaN
    AT11 2009 NaN
    AT12 2009 NaN
    ds_material_deprivation = df.to_xarray()
    ds_material_deprivation.to_netcdf("./data/material_deprivation_nuts_2.nc")
    ds_material_deprivation
    
    <xarray.Dataset>
    Dimensions:                          (nuts2: 179, year: 12)
    Coordinates:
      * nuts2                            (nuts2) object 'AL01' 'AL02' ... 'SK04'
      * year                             (year) int64 2009 2010 2011 ... 2019 2020
    Data variables:
        percentage_material_deprivation  (nuts2, year) float64 nan nan ... 9.6 8.3
  2. At-risk-of-poverty rate by NUTS 2 regions [TGS00103]
    df = eurostat.get_data_df('tgs00103')
    df.rename({'geo\\time':'nuts2'},inplace=True,axis=1)
    df = df[(df.unit == "PC_POP")]
    df.drop(["unit"],axis=1,inplace=True)
    df['nuts'] = [len(x)-2 for x in df.nuts2]
    df = df[df.nuts == 2]
    df.drop('nuts',axis=1,inplace=True)
    
    df = pd.melt(df,id_vars=['nuts2'],
                            value_vars=np.arange(2010,2022),
                            var_name='year',
                 value_name='at risk of poverty')
    df.set_index(['nuts2','year'],inplace=True)
    df.head()
    
    at risk of poverty
    nuts2 year
    AL01 2010 NaN
    AL02 2010 NaN
    AL03 2010 NaN
    BG31 2010 30.5
    BG32 2010 28.8
    ds_at_risk_of_poverty = df.to_xarray()
    ds_at_risk_of_poverty.to_netcdf("./data/at_risk_of_poverty_nuts_2.nc")
    ds_at_risk_of_poverty
    
    <xarray.Dataset>
    Dimensions:             (nuts2: 173, year: 12)
    Coordinates:
      * nuts2               (nuts2) object 'AL01' 'AL02' 'AL03' ... 'SK03' 'SK04'
      * year                (year) int64 2010 2011 2012 2013 ... 2018 2019 2020 2021
    Data variables:
        at risk of poverty  (nuts2, year) float64 nan nan nan nan ... 17.0 16.5 nan

3.2.4. unmet needs

Self-reported unmet needs for medical examination by main reason declared and NUTS 2 regions

value  
TOOEXP Too expensive
TOOFAR Too far to travel
TOOEFW Too expensive or too far to travel or waiting list
NOTIME No time
NO_UNMET No unmet needs to declare
NOKNOW Didn't know any good doctor or specialist
WAITING Waiting list
FEAR Fear of doctor, hospital, examination or treatment
HOPING Wanted to wait and see if problem got better on its own
OTH Other

We drop the values NO_UNMET and OTH. The former is dropped because we are interested in unmet medical needs. The latter because the reason is unclear.

df = eurostat.get_data_df('hlth_silc_08_r')
df.rename({'geo\\time':'nuts2'},inplace=True,axis=1)
df.drop(["unit"],axis=1,inplace=True)
df['nuts'] = [len(x)-2 for x in df.nuts2]
df = df[df.nuts == 2]
df.drop('nuts',axis=1,inplace=True)
columns = ['FEAR','TOOEXP','TOOFAR','WAITING',\
           'HOPING', 'NOTIME', 'NOKNOW']
df=df[df.reason.isin(columns)]
df = pd.melt(df,id_vars=['nuts2','reason'],
                        value_vars=np.arange(2009,2021),
                        var_name='year',\
             value_name='unmet needs')
df.set_index(['nuts2','year','reason'],inplace=True)

df = df.unstack()
df.columns = df.columns.droplevel()
df['UNMET'] = df[columns].sum(axis=1)
df.drop(['FEAR','TOOFAR','WAITING',\
           'HOPING', 'NOTIME', 'NOKNOW'],axis=1,inplace=True)
df.tail()
reason TOOEXP UNMET
nuts2 year
SK04 2016 0.5 7.4
2017 0.2 5.5
2018 0.4 6.2
2019 0.4 6.6
2020 0.2 6.0
ds_unmet_needs = df.to_xarray()
ds_unmet_needs.to_netcdf("./data/unmet_needs_nuts_2.nc")
ds_unmet_needs
<xarray.Dataset>
Dimensions:  (nuts2: 106, year: 12)
Coordinates:
  * nuts2    (nuts2) object 'AL01' 'AL02' 'AL03' 'BG31' ... 'SK02' 'SK03' 'SK04'
  * year     (year) int64 2009 2010 2011 2012 2013 ... 2016 2017 2018 2019 2020
Data variables:
    TOOEXP   (nuts2, year) float64 nan nan nan nan nan ... 0.5 0.2 0.4 0.4 0.2
    UNMET    (nuts2, year) float64 0.0 0.0 0.0 0.0 0.0 ... 7.4 5.5 6.2 6.6 6.0

3.2.5. infant mortality

df = eurostat.get_data_df('demo_r_minfind')
df.rename({'geo\\time':'nuts2'},inplace=True,axis=1)
df.drop(["unit"],axis=1,inplace=True)
df['nuts'] = [len(x)-2 for x in df.nuts2]
df = df[df.nuts == 2]
df.drop('nuts',axis=1,inplace=True)
df = pd.melt(df,id_vars=['nuts2'],
                        value_vars=np.arange(2009,2022),
                        var_name='year',\
             value_name='infant mortality')
df.set_index(['nuts2','year'],inplace=True)

# df = df.unstack()
# df.columns = df.columns.droplevel()
# df['UNMET'] = df[columns].sum(axis=1)
# df.drop(['FEAR','TOOFAR','WAITING',\
#            'HOPING', 'NOTIME', 'NOKNOW'],axis=1,inplace=True)
df.head()
            infant mortality
nuts2 year                  
AL01  2009               NaN
AL02  2009               NaN
AL03  2009               NaN
AT11  2009               0.9
AT12  2009               4.2
ds_infant_mortality = df.to_xarray()
ds_infant_mortality.to_netcdf("./data/infant_mortality_nuts_2.nc")
ds_infant_mortality
<xarray.Dataset>
Dimensions:           (nuts2: 340, year: 13)
Coordinates:
  * nuts2             (nuts2) object 'AL01' 'AL02' 'AL03' ... 'UKM9' 'UKN0'
  * year              (year) int64 2009 2010 2011 2012 ... 2018 2019 2020 2021
Data variables:
    infant mortality  (nuts2, year) float64 nan nan nan nan ... 4.2 nan nan nan

4. Combining data

The sections below generate the datasets per variable. Once these datasets are generated, the code in this section merges all these variables into one .csv file.

4.1. nuts2 data

In this subsection, we merge the data at the NUTS 2 level.

from glob import glob
ds_2 = xr.merge([xr.open_dataset(f) for f in glob('./data/*.nc')])
ds_2

<xarray.Dataset>
Dimensions:                          (age: 100, nuts2: 348, sex: 2, year: 32)
Coordinates:
  * nuts2                            (nuts2) object 'AL01' 'AL02' ... 'UKN0'
  * year                             (year) int64 1990 1991 1992 ... 2020 2021
  * age                              (age) float64 1.0 2.0 3.0 ... 99.0 100.0
  * sex                              (sex) object 'F' 'M'
Data variables:
    infant mortality                 (nuts2, year) float64 nan nan ... nan nan
    country                          (nuts2, year, sex, age) object nan ... nan
    population                       (nuts2, year, sex, age) float64 nan ... nan
    age_category                     (nuts2, year, sex, age) object nan ... nan
    bad_self_perceived_health        (nuts2, year, sex, age) float64 nan ... nan
    at risk of poverty               (nuts2, year) float64 nan nan ... nan nan
    lagged_mortality                 (year, age, nuts2, sex) float64 nan ... nan
    percentage_material_deprivation  (nuts2, year) float64 nan nan ... nan nan
    deaths                           (nuts2, year, sex, age) float64 nan ... nan
    TOOEXP                           (nuts2, year) float64 nan nan ... nan nan
    UNMET                            (nuts2, year) float64 nan nan ... nan nan
df = ds_2.to_dataframe()
df.reset_index(inplace=True)
df.rename({'level_0':'age','level_1':'nuts2','level_3':'year'},\
          axis=1,inplace=True)
df['country_code'] = df['nuts2'].str.slice(0,2)
df['nuts1'] = df['nuts2'].str.slice(0,3)
df['country'] = df['country_code'].replace(eurostat_dictionary)
df.head()
   age nuts2 sex  year  infant mortality  country  population age_category  \
0  1.0  AL01   F  1990               NaN  Albania         NaN          NaN   
1  1.0  AL01   F  1991               NaN  Albania         NaN          NaN   
2  1.0  AL01   F  1992               NaN  Albania         NaN          NaN   
3  1.0  AL01   F  1993               NaN  Albania         NaN          NaN   
4  1.0  AL01   F  1994               NaN  Albania         NaN          NaN   

   bad_self_perceived_health  at risk of poverty  lagged_mortality  \
0                        NaN                 NaN               NaN   
1                        NaN                 NaN               NaN   
2                        NaN                 NaN               NaN   
3                        NaN                 NaN               NaN   
4                        NaN                 NaN               NaN   

   percentage_material_deprivation  deaths  TOOEXP  UNMET country_code nuts1  
0                              NaN     NaN     NaN    NaN           AL   AL0  
1                              NaN     NaN     NaN    NaN           AL   AL0  
2                              NaN     NaN     NaN    NaN           AL   AL0  
3                              NaN     NaN     NaN    NaN           AL   AL0  
4                              NaN     NaN     NaN    NaN           AL   AL0  

4.2. country data

In this section, we add country level variables to the dataframe.

from glob import glob
ds_c = xr.merge([xr.open_dataset(f) for f in glob('./data/country/*.nc')])
ds_c

<xarray.Dataset>
Dimensions:                        (country: 33, country_code: 33, year: 28)
Coordinates:
  * year                           (year) int64 1992 1993 1994 ... 2018 2019
  * country                        (country) object 'Austria' ... 'United Kin...
  * country_code                   (country_code) object 'AT' 'BA' ... 'SK' 'UK'
Data variables:
    HF2_PC_CHE                     (country, country_code, year) float64 nan ...
    HF2_PPS_HAB                    (country, country_code, year) float64 nan ...
    HF3_PC_CHE                     (country, country_code, year) float64 nan ...
    HF3_PPS_HAB                    (country, country_code, year) float64 nan ...
    health expenditure per capita  (country, country_code, year) float64 nan ...
df_country = ds_c.to_dataframe()
df_country.reset_index(inplace=True)
df_country.rename({'level_0':'country','level_1':'country_code',\
                   'level_2':'year'},axis=1,inplace=True)
df_country.head()
   country country_code  year  HF2_PC_CHE  HF2_PPS_HAB  HF3_PC_CHE  \
0  Austria           AT  1992         NaN          NaN         NaN   
1  Austria           AT  1993         NaN          NaN         NaN   
2  Austria           AT  1994         NaN          NaN         NaN   
3  Austria           AT  1995         NaN          NaN         NaN   
4  Austria           AT  1996         NaN          NaN         NaN   

   HF3_PPS_HAB  health expenditure per capita  
0          NaN                            NaN  
1          NaN                            NaN  
2          NaN                            NaN  
3          NaN                            NaN  
4          NaN                            NaN  
df = df.merge(df_country, how = 'left',\
              on = ['country','year','country_code'])
# drop rows where deaths > population at Jan. 1st
# (e.g. because people moved in the year)
df = df[(df.population >= df.deaths) & (df.population > 0)]
df['mortality'] = df.deaths/df.population

df.head()
     age nuts2 sex  year  infant mortality  country  population age_category  \
192  1.0  AT11   F  1990               NaN  Austria      1345.0        Y0-15   
193  1.0  AT11   F  1991               NaN  Austria      1371.0        Y0-15   
194  1.0  AT11   F  1992               NaN  Austria      1436.0        Y0-15   
195  1.0  AT11   F  1993               NaN  Austria      1372.0        Y0-15   
196  1.0  AT11   F  1994               NaN  Austria      1349.0        Y0-15   

     bad_self_perceived_health  at risk of poverty  ...  TOOEXP  UNMET  \
192                        NaN                 NaN  ...     NaN    NaN   
193                        NaN                 NaN  ...     NaN    NaN   
194                        NaN                 NaN  ...     NaN    NaN   
195                        NaN                 NaN  ...     NaN    NaN   
196                        NaN                 NaN  ...     NaN    NaN   

     country_code  nuts1  HF2_PC_CHE HF2_PPS_HAB HF3_PC_CHE  HF3_PPS_HAB  \
192            AT    AT1         NaN         NaN        NaN          NaN   
193            AT    AT1         NaN         NaN        NaN          NaN   
194            AT    AT1         NaN         NaN        NaN          NaN   
195            AT    AT1         NaN         NaN        NaN          NaN   
196            AT    AT1         NaN         NaN        NaN          NaN   

     health expenditure per capita  mortality  
192                            NaN   0.000000  
193                            NaN   0.000000  
194                            NaN   0.000696  
195                            NaN   0.000000  
196                            NaN   0.000741  

[5 rows x 23 columns]

Finally, we save the merged NUTS 2 and country data into a .csv file.

df.to_csv('./data/data_deaths_by_age_nuts_2.csv')

Author: Jan Boone

Validate