Screencasts for Seminar Datascience for Economics Part 2

Under construction

This page contains the screencasts for the second part of the course Seminar Datascience for Economics.

The videos go through the questions in this notebook.

Table of Contents

Dealing with data

mortality

Topics we cover in this video:

  • use eurostat.get_data_df() to download variables directly into a pandas dataframe
  • use the .unique() method to find the values of a variable
  • .rename() to give a column a different name
  • .replace() to change the (row) values of a column
  • use pd.melt() to "change columns into rows"
  • .set_index together with .unstack() to get variables in rows "into different columns"
  • reset_index() removes the index hierarchy to get a dataframe without an index
  • and then renaming columns with df.columns = [' '.join(col).strip() for col in df.columns.values]

Questions you should be able to answer before continuing:

  • consider the following two statements, where we replace/rename a label:
    • df['country'] = df['geo'].replace(eurostat_dictionary)
    • df.rename({'geo\\time':'geo'},inplace=True, axis=1)
  • why does the second statement use inplace = True and the first does not?

GDP

Topics we cover in this video:

  • same as above to create the dataframe for GDP
  • we call this dataframe df_n in order to be able to merge this data in the next step:
  • merge statement to combine this GDP data with the mortality data above

Questions you should be able to answer before continuing:

  • why do we only need a melt statement for GDP and not unstack?

Income quantiles

Topics we cover in this video:

  • same as above

Questions you should be able to answer before continuing:

  • after merging the income quantiles data with the dataframe created above, in the video we get the columns year_x and year_y. Rename year_x to year and delete the column year_y
  • [advanced] after we merged the gdp and mortality data above, we did not get columns year_x and year_y: why this difference with the merge in this video?
    • what could have done differently in the video to avoid the year_x and year_y columns?

life style variables

Topics we cover in this video:

  • downloading the data for BMI and alcohol consumption is similar to what we did before: the columns that we want to "turn into rows" are years
  • hence we can use np.arange(2011,2021) or even simpler [2014] (if we only want one year) in the melt statement
  • at the time of making the video, the Eurostat data for smoking has the country abbreviations as columns that we want to "turn into rows"
  • here we use df_n.columns[4:] in the melt statement

Questions you should be able to answer before continuing:

  • download the alcohol data on your own
  • merge the lifestyle variables into the dataframe df with all the variables we downloaded above
  • save you dataframe df into the folder data (create this folder if you do not have it yet):
    • df.to_csv('./data/health_data.csv')
  • we will use this data below.

Bayesian statistics

Introduction

Topics we cover in this video:

  • Bayes' Theorem: \(P(A|B) = \frac{P(A and B)}{P(B)} = \frac{P(B|A)P(A)}{P(B)}\)
  • we will want to estimate parameters \(\theta\) for a model; the posterior distribution for \(\theta\) is then given by:
\begin{equation} P(\theta | \text{data}) = \frac{P(\text{data}|\theta)*P(\theta)}{P(\text{data})} \end{equation}
  • below we will use pymc3 (imported as pm) to derive the posterior distribution \(P(\theta | \text{data})\):
    • the probability distribution of the parameters \(\theta\) given the data that we have
  • we use Bayes' theorem to find the probability that someone is a vampire, conditional on testing positive

Questions you should be able to answer before continuing:

  • suppose you throw a coin \(n\) times and \(p\) is the probability of "head"; what is the probability distribution of \(y\) (number of heads in \(n\) throws) called?

Simple Bayesian model

Topics we cover in this video:

  • binomial distribution
  • we have data on an experiment where a coin is throw up to 20 times; we see how the posterior evolves after 1, 2, 3, 4, 5 and 20 throws
  • do not worry about the pymc3 syntax yet (this will be explained in later videos)
  • focus on the figure showing how the posterior distribution for p evolves and try to understand how different values for \(y\) affect this distribution

Questions you should be able to answer before continuing:

  • choose different values for y_values and see how this affects the development of the posterior distribution
  • calculate for this posterior the probability that \(p>0.8\) after throwing 20 times and \(y\) value you have chosen here
    • is it bigger or smaller than the probability we find in the video? Why?

Estimating a Bayesian model

formulating the model

Topics we cover in this video:

  • start pymc3 model as follows: with pm.Model() as model_name:
  • then give the priors for the parameters of the models: slope = pm.Normal('slope', 0,1)
  • link the model to observed variable in the data, here df.y (column y from dataframe df):
    • obs = pm.Normal('obs',mu = mu, sigma = sigma, observed = df.y)

Questions you should be able to answer before continuing:

  • use pm.sample? to see how you can modify the number of draws from the posterior distribution

sampling from the model

Topics we cover in this video:

  • sample from the posterior with trace = pm.sample()
  • the posterior predictive gives for each observation ('row') in your data, the prediction of the 'y' variable for this observation
  • this is useful to check the fit of the model
  • use pm.sample_posterior_prediction(trace,varnames = ['obs']) to generate the posterior predictive
  • to capture the trace and the posterior predictive into one object use az.from_pymc3()
  • this can be used to save the trace and posterior predictive together

Questions you should be able to answer before continuing:

  • choose different parameter values in the function generating_data() and see how the posterior distributions for the parameters change
    • what happens to the posterior distributions for the parameters when you increase the number of observations n?
    • what happens if the parameter values chosen do not overlap with the priors specified in the model?
  • explain the three characteristics of the trace plots that need to be satisfied to trust the sampling process
  • what is "MCMC" the abbreviation of?

summarizing posterior distribution and model fit

Topics we cover in this video:

  • use of az.summary() to summarize the posterior in a table
  • the interpretation of r_hat in this table
  • plot the original data and the 95% intervals to judge the fit of the model
    • use of np.percentile() to find the percentiles
    • matplotlib.pyplot's vlines to plot the percentiles as a vertical line

Questions you should be able to answer before continuing:

  • what do hdi_3% and hdi_97% mean?
  • what does alpha = 0.2 mean in a plot statement?

Eurostat data

Topics we cover in this video:

  • loading the data with pandas's read_casv()
  • plotting the data with a matplotlib color map

Income inequality and standardizing data

Topics we cover in this video:

  • difference between np.log(df[income_quantiles].mean(axis = 1)) and (np.log(df[income_quantiles])).mean(axis = 1)
    • i.e. order of taking the logarithm and the mean of the quantiles
  • using dropna() to delete observations with missing values (which is, actually, a bad idea)
  • defining a function standardize() which subtracts the mean of a variable and divides by standard deviation and then apply this function to the variables in the dataframe

Questions you should be able to answer before continuing:

  • what are advantages of standardizing your data?

Bayesian Model

Topics we cover in this video:

  • we build a model to explain how log-mortality varies between countries and genders
  • we use 'mortality' instead of 'obs' when linking the model to the observed (standardized) variable log_mortality

Questions you should be able to answer before continuing:

  • explain why some priors are based on a normal distribution and others on a half-normal distribution
  • use pm.sample? or use google to find out what the tune keyword does

Checking model

Topics we cover in this video:

  • plot the original data together with 95% prediction interval for log mortality

Questions you should be able to answer before continuing:

  • check the trace plots for the coefficients in variables_health
  • when variables are standardized, what does a mean coefficient of, say, -0.802 mean?
  • re-estimate the model with preventable mortality as variable to be explained and repeat all the steps above.

Interactive plot with Altair

Topics we cover in this video:

  • make an interactive plot with altair's Chart(df) method combined with .encode() and .interactive()
  • define a year selection with alt.selection_single()

Questions you should be able to answer before continuing:

  • make the same interactive plot with inequality on the horizontal axis

Dealing with missing data the Bayesian way

Masked arrays

Topics we cover in this video:

  • creating a masked array with np.ma.array()
  • if x is a masked array, use x.compressed() to get a readable representation
  • use np.ma.masked_invalid() to turn an existing array into a masked array

Questions you should be able to answer before continuing:

  • if a masked array has 4 entriew and mask = [0, 0, 1, 1], which entries are missing?

Estimating model with missing observations

Topics we cover in this video:

  • specify the priors for the missing variables using the observed keyword
  • use different names (e.g. capitalized) for the variables with missing values where the missing values are drawn from the prior distribution

Questions you should be able to answer before continuing:

  • summarize the posterior distribution with a table

Bayesian time series

Gaussian process

GP with exponentiated quadratic kernel

Topics we cover in this video:

  • generate our own data with a trend and cyclical component
  • formulate the model with a Gaussian Process:
    • pm.gp.cov.ExpQuad() for the kernel
    • pm.gp.Marginal() for the Gaussian Process
    • .marginal_likelihood() method on the Gaussian Process to link process to the data
  • use pm.ADVI() to approximate the posterior and then .sample() to sample from this posterior

Questions you should be able to answer before continuing:

  • use MCMC to sample the posterior; does this work well?
  • generate a different time series and fit the GP to this data using the ADVI sampler

Checking fit

Topics we cover in this video:

  • use of np_c as documented here

Questions you should be able to answer before continuing:

  • what do you think of the out-of-sample prediction of this kernel?

GP with periodic kernel

Topics we cover in this video:

  • defining the periodic kernel with
    • pm.gp.cov.Periodic() and pm.gp.Marginal()
    • defining a linear component with pm.gp.cov.Linear()
    • adding two GP's

Questions you should be able to answer before continuing:

  • finish the analysis with the ELBO-plot, sampling from the posterior, in-sample and out-of-sample predictions
  • compare the out-of-sample predictions here with the ones obtained from the exponentiated quadratic kernel: which one performs better in your view?

Covid data

Getting the data from John Hopkins University

Topics we cover in this video:

Questions you should be able to answer before continuing:

  • select data from a different country and different time period

Bayesian exponential model

Topics we cover in this video:

  • specifying a non-linear relationship \(y = a(1+b)^t\) to be estimated
  • sample from the prior predictive distribution; these are predictions based on the prior only, without using the data
  • plotting 50 samples from the prior-predictive to check whether the predictions look reasonable
  • use of zip() with two lists to generate "coordinates"
  • plotting the 95% prediction intervals

Questions you should be able to answer before continuing:

  • draw prior-predictive samples from a model based on \(y = e^{bt}\)
  • create triples (1,4,7), (2,5,8) etc. from the lists [1,2,3], [4,5,6] and [7,8,9]

Bayesian logistic model

Topics we cover in this video:

  • rewrite your model in terms of parameters that simplify the choice of prior distributions for these parameters
  • then use pm.Deterministic() to generate the parameter that you actually need in your model

Questions you should be able to answer before continuing:

  • get the trace with pm.sample()
  • check the trace plot and the summary of the trace in a table
  • and posterior predictive distribution with pm.sample_posterior_predictive()
  • plot the original data together with the 95% posterior predictive interval

Bayesian neural network

Generating the data and the grid

Topics we cover in this video:

  • use of make_moons() from sklearn to generate "nice" data for a supervised learning algorithm
  • use pm.floatX() to get data of the correct type for pymc3 (and its theano backbone) –don't worry if you do not understand this; usually this step is not necessary
  • use of np.mgrid[] to generate a two dimensional grid of coordinates:
    • np.mgrid[0:1:3j,10:11:4j] generates a \(2 \times 3 \times 4\) tensor of \((x,y)\) coordinates (first dimension),
      • going from 0 to 1 in 3 steps (the complex number j translates into number of steps; instead of step size; to see this try np.mgrid[0:1:0.25,10:11:4j] where 0.25 denotes step size)
      • going from 10 to 11 in 4 steps
    • then np.mgrid[0:1:3j,10:11:4j].reshape(2,-1).T
      • generates twelve (x,y) coordinates, starting with (0,10) and going to (1,11)
      • the "-1" in reshape() means: choose whatever dimension is necessary here for the reshape to work
        • to see this, try .reshape(3,-1) and .reshape(5,-1); why does the latter generate an error?

Questions you should be able to answer before continuing:

  • consider the integer pairs \((x,y) = (0,0), (1,0), (0,1), (2,0) ... (4,4)\); show that for 19 such pairs \((x,y)\) it the case that \(x+y > 2\)?

Code for the network

Topics we cover in this video:

  • the model follows a standard pymc3 model, with some additional "bells and whistles" as we are modeling a neural network:
    • the code is mainly provided as illustration; it is not necessary to fully understand the code of the model
    • but the main idea is important: the parameters/weights of a neural network can be thought of as being stochastic and hence we can consider the posterior distribution of the weights and the predictions of the network
    • weights in the first layer weights_in_1 need to "receive" the inputs of the features (columns) in data matrix \(X\) and output to the number of nodes in the first hidden layer. The shape keyword in pm.Normal() takes care of this when defining the prior for these weights
    • testval sometimes helps to get the sampling of the posterior going; it is like providing an initial guess in a function like fsolve()
    • we also specify the weights of the first and second hidden layers
    • we define the activation functions and outcomes for the three layers: act_1, act_2, act_out
    • these activations will be defined in terms of our inputs (data) \(X\)
    • in addition to this, we want to plot the predictions of the network on the grid grid_2d that we defined
    • for this we use pm.Deterministic() to define the probability p, which is again based on the three activation functions but now applied to grid_2d and not to data \(X\)
    • our final classification is based on the classification propability act_out which is based on data \(X\); as there are only two classes (blue and red), this is a Bernoulli process (one draw from a Binomial distribution)
    • theano.shared() makes sure that variables defined outside the pymc3 model, work inside it as well (again, do not worry about this)
  • we use ADVI to sample the posterior as it is faster than MCMC

Questions you should be able to answer before continuing:

  • what is the difference between A*B and pm.math.dot(A,B) for two tensors A and B? [hint: how can you find out what pm.math.dot() does?]

Evaluating the estimated network

Topics we cover in this video:

  • we plot the ELBO to see whether the optimization step in ADVI converged; we get the "inverse J" shape that we would expect
  • since we draw weigths from our posterior, there is not one decision boundary, but we have a number of these; we plot the first 50 of these boundaries
    • the boundary is the contour ("iso probability line") corresponding to \(p=0.5\)
  • this shows that in some regions we are pretty sure where the boundary lies, while in other regions the boundary varies more across posterior samples

Questions you should be able to answer before continuing:

  • just for illustrative purposes: plot the 0.4 and 0.6 contours; we will use color below to plot "all contours" at once

Further graphical illustrations of the estimated network

Topics we cover in this video:

  • using a color map, we plot all probability contours (iso probability lines) at the same time
  • hence, we can say things like: for this point we are 70% sure that the label is "red" where we take the average probability across posterior samples
  • to get an idea of the uncertainty across samples, we plot the standard deviation of the probabilities across posterior samples

Questions you should be able to answer before continuing:

  • in this case, the contour plots of the means \(p\) across posterior samples and of the standard deviations across samples look very similar. Explain why this is [hint: consider the expression for the standard deviation of a binomial distribution]

What is a confidence interval

Topics we cover in this video:

  • we use simulations to show what a 95% confidence interval and a credible interval is
  • then we check that the relevant parameter lies in the interval 95% of time

Questions you should be able to answer before continuing:

  • do the analysis with different values of n_sample, precision; for very small values of precision you may have to increase repetitions
  • if \(\sigma\) is not known, we need to estimate it from the sample; in that case, we need to use the $t$-distribution to calculate the confidence interval (instead of a normal distribution). If necessary use the example here and program this case to check that the confidence interval contains \(\mu\) in 95% of the cases.

Author: Jan Boone