Screencasts for Seminar Datascience for Economics Part 2

Under construction

This page contains the screencasts for the second part of the course Seminar Datascience for Economics.

The videos go through the questions in this notebook.

Dealing with data
Bayesian statistics
- Introduction
- Simple Bayesian model
Estimating a Bayesian model
Dealing with missing data the Bayesian way
- Masked arrays
- Estimating model with missing observations
Bayesian time series
- Covid data
Bayesian neural network
What is a confidence interval

Dealing with data

mortality

downloading mortality variables

Topics we cover in this video:

use eurostat.get_data_df() to download variables directly into a pandas dataframe
use the .unique() method to find the values of a variable
.rename() to give a column a different name
.replace() to change the (row) values of a column
- note that eurostat has changed the format of its data since the video was made
- the column that needs to be renamed is now denoted by geo\TIME_PERIOD instead of geo\time as in the video
use pd.melt() to "change columns into rows"
- also here the eurostat interface has changed: the year-columns are strings (these were integers when the video was recorded)
- this leads to 2 changes:
  - in the melt statement we use value_vars = [str(y) for y in np.arange(2011,2018)] as we need to refer to strings like "2011"
  - when we want to refer to year as an integer we need: df['year'] = df.year.astype('int')
.set_index together with .unstack() to get variables in rows "into different columns"
reset_index() removes the index hierarchy to get a dataframe without an index
and then renaming columns with df.columns = [' '.join(col).strip() for col in df.columns.values]

Questions you should be able to answer before continuing:

consider the following two statements, where we replace/rename a label:
- df['country'] = df['geo'].replace(eurostat_dictionary)
- df.rename({'geo\\TIME_PERIOD':'geo'},inplace=True, axis=1)
why does the second statement use inplace = True and the first does not?

GDP

downloading gdp

Topics we cover in this video:

same as above to create the dataframe for GDP
we call this dataframe df_n in order to be able to merge this data in the next step:
merge statement to combine this GDP data with the mortality data above

Questions you should be able to answer before continuing:

why do we only need a melt statement for GDP and not unstack?

Income quantiles

downloading income quantiles

Topics we cover in this video:

same as above

Questions you should be able to answer before continuing:

after merging the income quantiles data with the dataframe created above, in the video we get the columns year_x and year_y. Rename year_x to year and delete the column year_y
[advanced] after we merged the gdp and mortality data above, we did not get columns year_x and year_y: why this difference with the merge in this video?
- what could have done differently in the video to avoid the year_x and year_y columns?

life style variables

downloading lifestyle variables

Topics we cover in this video:

downloading the data for BMI and alcohol consumption is similar to what we did before: the columns that we want to "turn into rows" are years
hence we can use np.arange(2011,2021) or even simpler ['2014'] (if we only want one year) in the melt statement
- in the video, year is an integer and he use [2014]
at the time of making the video, the Eurostat data for smoking has the country abbreviations as columns that we want to "turn into rows"
- here we use df_n.columns[4:] in the melt statement
- nowadays the structure of the lifestyle data is similar to what we saw above with the column geo\TIME_PERIOD
- the melt statement should now use value_vars = ['2014']

Questions you should be able to answer before continuing:

download the alcohol data on your own
merge the lifestyle variables into the dataframe df with all the variables we downloaded above
save you dataframe df into the folder data (create this folder if you do not have it yet):
- df.to_csv('./data/health_data.csv')
we will use this data below.

Bayesian statistics

Introduction

Bayes' theorem

Topics we cover in this video:

Bayes' Theorem: $P(A|B) = \frac{P(A and B)}{P(B)} = \frac{P(B|A)P(A)}{P(B)}$
we will want to estimate parameters $\theta$ for a model; the posterior distribution for $\theta$ is then given by:

\begin{equation} P(\theta | \text{data}) = \frac{P(\text{data}|\theta)*P(\theta)}{P(\text{data})} \end{equation}

below we will use pymc (imported as pm) to derive the posterior distribution $P(\theta | \text{data})$:
- the probability distribution of the parameters $\theta$ given the data that we have
we use Bayes' theorem to find the probability that someone is a vampire, conditional on testing positive
note that the video uses the older library pymc3, but that will give an error on the university jupyterlab server

Questions you should be able to answer before continuing:

suppose you throw a coin $n$ times and $p$ is the probability of "head"; what is the probability distribution of $y$ (number of heads in $n$ throws) called?

Simple Bayesian model

First Baysian model

Topics we cover in this video:

binomial distribution
we have data on an experiment where a coin is throw up to 20 times; we see how the posterior evolves after 1, 2, 3, 4, 5 and 20 throws
do not worry about the pymc syntax yet (this will be explained in later videos)
focus on the figure showing how the posterior distribution for p evolves and try to understand how different values for $y$ affect this distribution

Questions you should be able to answer before continuing:

choose different values for y_values and see how this affects the development of the posterior distribution
calculate for this posterior the probability that $p>0.8$ after throwing 20 times and $y$ value you have chosen here
- is it bigger or smaller than the probability we find in the video? Why?

Estimating a Bayesian model

formulating the model

Model specification

Topics we cover in this video:

start pymc model as follows: with pm.Model() as model_name:
then give the priors for the parameters of the models: slope = pm.Normal('slope', 0,1)
link the model to observed variable in the data, here df.y (column y from dataframe df):
- obs = pm.Normal('obs',mu = mu, sigma = sigma, observed = df.y)

Questions you should be able to answer before continuing:

use pm.sample? to see how you can modify the number of draws from the posterior distribution

sampling from the model

Sampling from the model

Topics we cover in this video:

sample from the posterior with trace = pm.sample()
the posterior predictive gives for each observation ('row') in your data, the prediction of the 'y' variable for this observation
this is useful to check the fit of the model
use pm.sample_posterior_predictive(trace, extend_inferencedata = True) to generate the posterior predictive
- in the older pymc3 syntax, the statement was: pm.sample_posterior_prediction(trace,varnames = ['obs'])
- to capture the trace and the posterior predictive into one we used az.from_pymc3() in the video, but this can now be achieved with extend_inferencedata = True

Questions you should be able to answer before continuing:

choose different parameter values in the function generating_data() and see how the posterior distributions for the parameters change
- what happens to the posterior distributions for the parameters when you increase the number of observations n?
- what happens if the parameter values chosen do not overlap with the priors specified in the model?
explain the three characteristics of the trace plots that need to be satisfied to trust the sampling process
what is "MCMC" the abbreviation of?

summarizing posterior distribution and model fit

summarizing posterior and fit

Topics we cover in this video:

use of az.summary() to summarize the posterior in a table
the interpretation of r_hat in this table
plot the original data and the 95% intervals to judge the fit of the model
- use of np.percentile() to find the percentiles
- matplotlib.pyplot's vlines to plot the percentiles as a vertical line
- note that with the new pymc syntax, the posterior_predictive is now part of trace because we said extend_inferencedata = True in pm.sample_posterior_predictive()

Questions you should be able to answer before continuing:

what do hdi_3% and hdi_97% mean?
what does alpha = 0.2 mean in a plot statement?

Eurostat data

Loading and plotting Eurostat data

Topics we cover in this video:

loading the data with pandas's read_casv()
plotting the data with a matplotlib color map

Income inequality and standardizing data

Topics we cover in this video:

difference between np.log(df[income_quantiles].mean(axis = 1)) and (np.log(df[income_quantiles])).mean(axis = 1)
- i.e. order of taking the logarithm and the mean of the quantiles
using dropna() to delete observations with missing values (which is, actually, a bad idea)
defining a function standardize() which subtracts the mean of a variable and divides by standard deviation and then apply this function to the variables in the dataframe

Questions you should be able to answer before continuing:

what are advantages of standardizing your data?

Bayesian Model

Bayesian model

Topics we cover in this video:

we build a model to explain how log-mortality varies between countries and genders
we use 'mortality' instead of 'obs' when linking the model to the observed (standardized) variable log_mortality
in the old pymc3 syntax, you could use the keyword sd to specify the standard deviation of a normal distribution. In pymc the keyword is sigma, but you do not actually need to specify the keyword. Thus constant = pm.Normal('constant', 0.0, 1.0) is the same as constant = pm.Normal('constant', mu = 0.0, sigma = 1.0).

Questions you should be able to answer before continuing:

explain why some priors are based on a normal distribution and others on a half-normal distribution
use pm.sample? or use google to find out what the tune keyword does

Checking model

Checking the model

Topics we cover in this video:

plot the original data together with 95% prediction interval for log mortality

Questions you should be able to answer before continuing:

check the trace plots for the coefficients in variables_health
when variables are standardized, what does a mean coefficient of, say, -0.802 mean?
re-estimate the model with preventable mortality as variable to be explained and repeat all the steps above.

Interactive plot with Altair

Interactive Plot Altair

Topics we cover in this video:

make an interactive plot with altair's Chart(df) method combined with .encode() and .interactive()
define a year selection with alt.selection_single()

Questions you should be able to answer before continuing:

make the same interactive plot with inequality on the horizontal axis

Dealing with missing data the Bayesian way

Masked arrays

Topics we cover in this video:

creating a masked array with np.ma.array()
if x is a masked array, use x.compressed() to get a readable representation
use np.ma.masked_invalid() to turn an existing array into a masked array

Questions you should be able to answer before continuing:

if a masked array has 4 entriew and mask = [0, 0, 1, 1], which entries are missing?

Estimating model with missing observations

Model with missing observations

Topics we cover in this video:

specify the priors for the missing variables using the observed keyword
use different names (e.g. capitalized) for the variables with missing values where the missing values are drawn from the prior distribution

Questions you should be able to answer before continuing:

summarize the posterior distribution with a table

Bayesian time series

Covid data

Getting the data from John Hopkins University

getting covid data

Topics we cover in this video:

use of pandas API to John Hopkins covid data
selecting data from the Netherlands

Questions you should be able to answer before continuing:

select data from a different country and different time period

Bayesian exponential model

specifying exponential model

Topics we cover in this video:

specifying a non-linear relationship $y = a(1+b)^t$ to be estimated
sample from the prior predictive distribution; these are predictions based on the prior only, without using the data
plotting 50 samples from the prior-predictive to check whether the predictions look reasonable
use of zip() with two lists to generate "coordinates"
plotting the 95% prediction intervals

Questions you should be able to answer before continuing:

draw prior-predictive samples from a model based on $y = e^{bt}$
create triples (1,4,7), (2,5,8) etc. from the lists [1,2,3], [4,5,6] and [7,8,9]

Bayesian logistic model

logistic model

Topics we cover in this video:

rewrite your model in terms of parameters that simplify the choice of prior distributions for these parameters
then use pm.Deterministic() to generate the parameter that you actually need in your model
here we give the posterior_predictive a new name, instead of adding it to the trace. Both options are possible in pymc

Questions you should be able to answer before continuing:

get the trace with pm.sample()
check the trace plot and the summary of the trace in a table
and posterior predictive distribution with pm.sample_posterior_predictive()
plot the original data together with the 95% posterior predictive interval

Bayesian neural network

Generating the data and the grid

Data and grid

Topics we cover in this video:

use of make_moons() from sklearn to generate "nice" data for a supervised learning algorithm
use pm.floatX() to get data of the correct type for pymc (and its pytensor backbone) –don't worry if you do not understand this; usually this step is not necessary
use of np.mgrid[] to generate a two dimensional grid of coordinates:
- np.mgrid[0:1:3j,10:11:4j] generates a $2 \times 3 \times 4$ tensor of $(x,y)$ coordinates (first dimension),
  - going from 0 to 1 in 3 steps (the complex number j translates into number of steps; instead of step size; to see this try np.mgrid[0:1:0.25,10:11:4j] where 0.25 denotes step size)
  - going from 10 to 11 in 4 steps
- then np.mgrid[0:1:3j,10:11:4j].reshape(2,-1).T
  - generates twelve (x,y) coordinates, starting with (0,10) and going to (1,11)
  - the "-1" in reshape() means: choose whatever dimension is necessary here for the reshape to work
    - to see this, try .reshape(3,-1) and .reshape(5,-1); why does the latter generate an error?

Questions you should be able to answer before continuing:

consider the integer pairs $(x,y) = (0,0), (1,0), (0,1), (2,0) ... (4,4)$; show that for 19 such pairs $(x,y)$ it the case that $x+y > 2$?

Code for the network

code for the network

Topics we cover in this video:

the model follows a standard pymc model, with some additional "bells and whistles" as we are modeling a neural network:
- the code is mainly provided as illustration; it is not necessary to fully understand the code of the model
- but the main idea is important: the parameters/weights of a neural network can be thought of as being stochastic and hence we can consider the posterior distribution of the weights and the predictions of the network
- weights in the first layer weights_in_1 need to "receive" the inputs of the features (columns) in data matrix $X$ and output to the number of nodes in the first hidden layer. The shape keyword in pm.Normal() takes care of this when defining the prior for these weights
- testval sometimes helps to get the sampling of the posterior going; it is like providing an initial guess in a function like fsolve(); it is used in the video, but not (anymore) in the code in the notebook
- we also specify the weights of the first and second hidden layers
- we define the activation functions and outcomes for the three layers: act_1, act_2, act_out
- these activations will be defined in terms of our inputs (data) $X$
- in addition to this, we want to plot the predictions of the network on the grid grid_2d that we defined
- for this we use pm.Deterministic() to define the probability p, which is again based on the three activation functions but now applied to grid_2d and not to data $X$
- our final classification is based on the classification propability act_out which is based on data $X$; as there are only two classes (blue and red), this is a Bernoulli process (one draw from a Binomial distribution)
- pytensor.shared() makes sure that variables defined outside the pymc model, work inside it as well (again, do not worry about this)
in the video we used ADVI to sample the posterior as it is faster than MCMC; the new version of pymc allows us to use the numpyro sampler which is also fast. If this sampler is not available, just use pm.sample() (it will take a bit more time).

Questions you should be able to answer before continuing:

what is the difference between A*B and pm.math.dot(A,B) for two tensors A and B? [hint: how can you find out what pm.math.dot() does?]

Evaluating the estimated network

evaluating the network

Topics we cover in this video:

we plot the ELBO to see whether the optimization step in ADVI converged; we get the "inverse J" shape that we would expect
- we the new version of pymc we do not use ADVI here and hence do not need to check the ELBO plot
since we draw weigths from our posterior, there is not one decision boundary, but we have a number of these; we plot the first 50 of these boundaries
- the boundary is the contour ("iso probability line") corresponding to $p=0.5$
this shows that in some regions we are pretty sure where the boundary lies, while in other regions the boundary varies more across posterior samples

Questions you should be able to answer before continuing:

just for illustrative purposes: plot the 0.4 and 0.6 contours

What is a confidence interval

confidence vs credible intervals

Topics we cover in this video:

we use simulations to show what a 95% confidence interval and a credible interval is
then we check that the relevant parameter lies in the interval 95% of time

Questions you should be able to answer before continuing:

do the analysis with different values of n_sample, precision; for very small values of precision you may have to increase repetitions
if $\sigma$ is not known, we need to estimate it from the sample; in that case, we need to use the $t$-distribution to calculate the confidence interval (instead of a normal distribution). If necessary use the example here and program this case to check that the confidence interval contains $\mu$ in 95% of the cases.

Screencasts for Seminar Datascience for Economics Part 2

Table of Contents

Dealing with data

mortality

GDP

Income quantiles

life style variables

Bayesian statistics

Introduction

Simple Bayesian model

Estimating a Bayesian model

formulating the model

sampling from the model

summarizing posterior distribution and model fit

Eurostat data

Income inequality and standardizing data

Bayesian Model

Checking model

Interactive plot with Altair

Dealing with missing data the Bayesian way

Masked arrays

Estimating model with missing observations

Bayesian time series

Covid data

Getting the data from John Hopkins University

Bayesian exponential model

Bayesian logistic model

Bayesian neural network

Generating the data and the grid

Code for the network

Evaluating the estimated network

What is a confidence interval