Screencasts for Seminar Datascience for Economics Part 2
Under construction
This page contains the screencasts for the second part of the course Seminar Datascience for Economics.
The videos go through the questions in this notebook.
Table of Contents
Dealing with data
mortality
Topics we cover in this video:
- use
eurostat.get_data_df()
to download variables directly into apandas
dataframe - use the
.unique()
method to find the values of a variable .rename()
to give a column a different name.replace()
to change the (row) values of a column- note that eurostat has changed the format of its data since the video was made
- the column that needs to be renamed is now denoted by
geo\TIME_PERIOD
instead ofgeo\time
as in the video
- use
pd.melt()
to "change columns into rows"- also here the eurostat interface has changed: the year-columns are strings (these were integers when the video was recorded)
- this leads to 2 changes:
- in the
melt
statement we usevalue_vars = [str(y) for y in np.arange(2011,2018)]
as we need to refer to strings like "2011" - when we want to refer to year as an integer we need:
df['year'] = df.year.astype('int')
- in the
.set_index
together with.unstack()
to get variables in rows "into different columns"reset_index()
removes the index hierarchy to get a dataframe without an index- and then renaming columns with
df.columns = [' '.join(col).strip() for col in df.columns.values]
Questions you should be able to answer before continuing:
- consider the following two statements, where we replace/rename a label:
df['country'] = df['geo'].replace(eurostat_dictionary)
df.rename({'geo\\TIME_PERIOD':'geo'},inplace=True, axis=1)
- why does the second statement use
inplace = True
and the first does not?
GDP
Topics we cover in this video:
- same as above to create the dataframe for GDP
- we call this dataframe
df_n
in order to be able to merge this data in the next step: merge
statement to combine this GDP data with the mortality data above
Questions you should be able to answer before continuing:
- why do we only need a
melt
statement for GDP and notunstack
?
Income quantiles
Topics we cover in this video:
- same as above
Questions you should be able to answer before continuing:
- after merging the income quantiles data with the dataframe created above, in the video we get the columns
year_x
andyear_y
. Renameyear_x
toyear
and delete the columnyear_y
- [advanced] after we merged the gdp and mortality data above, we did not get columns
year_x
andyear_y
: why this difference with the merge in this video?- what could have done differently in the video to avoid the
year_x
andyear_y
columns?
- what could have done differently in the video to avoid the
life style variables
Topics we cover in this video:
- downloading the data for BMI and alcohol consumption is similar to what we did before: the columns that we want to "turn into rows" are years
- hence we can use
np.arange(2011,2021)
or even simpler['2014']
(if we only want one year) in themelt
statement- in the video, year is an integer and he use
[2014]
- in the video, year is an integer and he use
- at the time of making the video, the Eurostat data for smoking has the country abbreviations as columns that we want to "turn into rows"
- here we use
df_n.columns[4:]
in the melt statement - nowadays the structure of the lifestyle data is similar to what we saw above with the column
geo\TIME_PERIOD
- the
melt
statement should now usevalue_vars = ['2014']
- here we use
Questions you should be able to answer before continuing:
- download the alcohol data on your own
- merge the lifestyle variables into the dataframe
df
with all the variables we downloaded above - save you dataframe
df
into the folderdata
(create this folder if you do not have it yet):df.to_csv('./data/health_data.csv')
- we will use this data below.
Bayesian statistics
Introduction
Topics we cover in this video:
- Bayes' Theorem: \(P(A|B) = \frac{P(A and B)}{P(B)} = \frac{P(B|A)P(A)}{P(B)}\)
- we will want to estimate parameters \(\theta\) for a model; the posterior distribution for \(\theta\) is then given by:
- below we will use
pymc
(imported aspm
) to derive the posterior distribution \(P(\theta | \text{data})\):- the probability distribution of the parameters \(\theta\) given the data that we have
- we use Bayes' theorem to find the probability that someone is a vampire, conditional on testing positive
- note that the video uses the older library
pymc3
, but that will give an error on the university jupyterlab server
Questions you should be able to answer before continuing:
- suppose you throw a coin \(n\) times and \(p\) is the probability of "head"; what is the probability distribution of \(y\) (number of heads in \(n\) throws) called?
Simple Bayesian model
Topics we cover in this video:
- binomial distribution
- we have data on an experiment where a coin is throw up to 20 times; we see how the posterior evolves after 1, 2, 3, 4, 5 and 20 throws
- do not worry about the
pymc
syntax yet (this will be explained in later videos) - focus on the figure showing how the posterior distribution for
p
evolves and try to understand how different values for \(y\) affect this distribution
Questions you should be able to answer before continuing:
- choose different values for
y_values
and see how this affects the development of the posterior distribution - calculate for this posterior the probability that \(p>0.8\) after throwing 20 times and \(y\) value you have chosen here
- is it bigger or smaller than the probability we find in the video? Why?
Estimating a Bayesian model
formulating the model
Topics we cover in this video:
- start
pymc
model as follows:with pm.Model() as model_name:
- then give the priors for the parameters of the models:
slope = pm.Normal('slope', 0,1)
- link the model to observed variable in the data, here
df.y
(columny
from dataframedf
):obs = pm.Normal('obs',mu = mu, sigma = sigma, observed = df.y)
Questions you should be able to answer before continuing:
- use
pm.sample?
to see how you can modify the number of draws from the posterior distribution
sampling from the model
Topics we cover in this video:
- sample from the posterior with
trace = pm.sample()
- the posterior predictive gives for each observation ('row') in your data, the prediction of the 'y' variable for this observation
- this is useful to check the fit of the model
- use
pm.sample_posterior_predictive(trace, extend_inferencedata = True)
to generate the posterior predictive- in the older
pymc3
syntax, the statement was:pm.sample_posterior_prediction(trace,varnames = ['obs'])
- to capture the trace and the posterior predictive into one we used
az.from_pymc3()
in the video, but this can now be achieved withextend_inferencedata = True
- in the older
Questions you should be able to answer before continuing:
- choose different parameter values in the function
generating_data()
and see how the posterior distributions for the parameters change- what happens to the posterior distributions for the parameters when you increase the number of observations
n
? - what happens if the parameter values chosen do not overlap with the priors specified in the model?
- what happens to the posterior distributions for the parameters when you increase the number of observations
- explain the three characteristics of the trace plots that need to be satisfied to trust the sampling process
- what is "MCMC" the abbreviation of?
summarizing posterior distribution and model fit
Topics we cover in this video:
- use of
az.summary()
to summarize the posterior in a table - the interpretation of
r_hat
in this table - plot the original data and the 95% intervals to judge the fit of the model
- use of
np.percentile()
to find the percentiles matplotlib.pyplot
'svlines
to plot the percentiles as a vertical line- note that with the new
pymc
syntax, theposterior_predictive
is now part oftrace
because we saidextend_inferencedata = True
inpm.sample_posterior_predictive()
- use of
Questions you should be able to answer before continuing:
- what do
hdi_3%
andhdi_97%
mean? - what does
alpha = 0.2
mean in a plot statement?
Eurostat data
Topics we cover in this video:
- loading the data with
pandas
'sread_casv()
- plotting the data with a
matplotlib
color map
Income inequality and standardizing data
Topics we cover in this video:
- difference between
np.log(df[income_quantiles].mean(axis = 1))
and(np.log(df[income_quantiles])).mean(axis = 1)
- i.e. order of taking the logarithm and the mean of the quantiles
- using
dropna()
to delete observations with missing values (which is, actually, a bad idea) - defining a function
standardize()
which subtracts the mean of a variable and divides by standard deviation and then apply this function to the variables in the dataframe
Questions you should be able to answer before continuing:
- what are advantages of standardizing your data?
Bayesian Model
Topics we cover in this video:
- we build a model to explain how log-mortality varies between countries and genders
- we use 'mortality' instead of 'obs' when linking the model to the observed (standardized) variable
log_mortality
- in the old
pymc3
syntax, you could use the keywordsd
to specify the standard deviation of a normal distribution. Inpymc
the keyword issigma
, but you do not actually need to specify the keyword. Thusconstant = pm.Normal('constant', 0.0, 1.0)
is the same asconstant = pm.Normal('constant', mu = 0.0, sigma = 1.0)
.
Questions you should be able to answer before continuing:
- explain why some priors are based on a normal distribution and others on a half-normal distribution
- use
pm.sample?
or use google to find out what thetune
keyword does
Checking model
Topics we cover in this video:
- plot the original data together with 95% prediction interval for log mortality
Questions you should be able to answer before continuing:
- check the trace plots for the coefficients in
variables_health
- when variables are standardized, what does a mean coefficient of, say,
-0.802
mean? - re-estimate the model with preventable mortality as variable to be explained and repeat all the steps above.
Interactive plot with Altair
Topics we cover in this video:
- make an interactive plot with
altair
'sChart(df)
method combined with.encode()
and.interactive()
- define a year selection with
alt.selection_single()
Questions you should be able to answer before continuing:
- make the same interactive plot with
inequality
on the horizontal axis
Dealing with missing data the Bayesian way
Masked arrays
Topics we cover in this video:
- creating a masked array with
np.ma.array()
- if
x
is a masked array, usex.compressed()
to get a readable representation - use
np.ma.masked_invalid()
to turn an existing array into a masked array
Questions you should be able to answer before continuing:
- if a masked array has 4 entriew and
mask = [0, 0, 1, 1]
, which entries are missing?
Estimating model with missing observations
Topics we cover in this video:
- specify the priors for the missing variables using the
observed
keyword - use different names (e.g. capitalized) for the variables with missing values where the missing values are drawn from the prior distribution
Questions you should be able to answer before continuing:
- summarize the posterior distribution with a table
Bayesian time series
Covid data
Getting the data from John Hopkins University
Topics we cover in this video:
- use of pandas API to John Hopkins covid data
- selecting data from the Netherlands
Questions you should be able to answer before continuing:
- select data from a different country and different time period
Bayesian exponential model
Topics we cover in this video:
- specifying a non-linear relationship \(y = a(1+b)^t\) to be estimated
- sample from the prior predictive distribution; these are predictions based on the prior only, without using the data
- plotting 50 samples from the prior-predictive to check whether the predictions look reasonable
- use of
zip()
with two lists to generate "coordinates" - plotting the 95% prediction intervals
Questions you should be able to answer before continuing:
- draw prior-predictive samples from a model based on \(y = e^{bt}\)
- create triples
(1,4,7), (2,5,8)
etc. from the lists[1,2,3], [4,5,6]
and[7,8,9]
Bayesian logistic model
Topics we cover in this video:
- rewrite your model in terms of parameters that simplify the choice of prior distributions for these parameters
- then use
pm.Deterministic()
to generate the parameter that you actually need in your model - here we give the
posterior_predictive
a new name, instead of adding it to the trace. Both options are possible inpymc
Questions you should be able to answer before continuing:
- get the trace with
pm.sample()
- check the trace plot and the summary of the trace in a table
- and posterior predictive distribution with
pm.sample_posterior_predictive()
- plot the original data together with the 95% posterior predictive interval
Bayesian neural network
Generating the data and the grid
Topics we cover in this video:
- use of
make_moons()
fromsklearn
to generate "nice" data for a supervised learning algorithm - use
pm.floatX()
to get data of the correct type forpymc
(and itspytensor
backbone) –don't worry if you do not understand this; usually this step is not necessary - use of
np.mgrid[]
to generate a two dimensional grid of coordinates:np.mgrid[0:1:3j,10:11:4j]
generates a \(2 \times 3 \times 4\) tensor of \((x,y)\) coordinates (first dimension),- going from 0 to 1 in 3 steps (the complex number
j
translates into number of steps; instead of step size; to see this trynp.mgrid[0:1:0.25,10:11:4j]
where 0.25 denotes step size) - going from 10 to 11 in 4 steps
- going from 0 to 1 in 3 steps (the complex number
- then
np.mgrid[0:1:3j,10:11:4j].reshape(2,-1).T
- generates twelve (x,y) coordinates, starting with (0,10) and going to (1,11)
- the "-1" in
reshape()
means: choose whatever dimension is necessary here for the reshape to work- to see this, try
.reshape(3,-1)
and.reshape(5,-1)
; why does the latter generate an error?
- to see this, try
Questions you should be able to answer before continuing:
- consider the integer pairs \((x,y) = (0,0), (1,0), (0,1), (2,0) ... (4,4)\); show that for 19 such pairs \((x,y)\) it the case that \(x+y > 2\)?
Code for the network
Topics we cover in this video:
- the model follows a standard
pymc
model, with some additional "bells and whistles" as we are modeling a neural network:- the code is mainly provided as illustration; it is not necessary to fully understand the code of the model
- but the main idea is important: the parameters/weights of a neural network can be thought of as being stochastic and hence we can consider the posterior distribution of the weights and the predictions of the network
- weights in the first layer
weights_in_1
need to "receive" the inputs of the features (columns) in data matrix \(X\) and output to the number of nodes in the first hidden layer. Theshape
keyword inpm.Normal()
takes care of this when defining the prior for these weights testval
sometimes helps to get the sampling of the posterior going; it is like providing an initial guess in a function likefsolve()
; it is used in the video, but not (anymore) in the code in the notebook- we also specify the weights of the first and second hidden layers
- we define the activation functions and outcomes for the three layers:
act_1, act_2, act_out
- these activations will be defined in terms of our inputs (data) \(X\)
- in addition to this, we want to plot the predictions of the network on the grid
grid_2d
that we defined - for this we use
pm.Deterministic()
to define the probabilityp
, which is again based on the three activation functions but now applied togrid_2d
and not to data \(X\) - our final classification is based on the classification propability
act_out
which is based on data \(X\); as there are only two classes (blue and red), this is a Bernoulli process (one draw from a Binomial distribution) pytensor.shared()
makes sure that variables defined outside thepymc
model, work inside it as well (again, do not worry about this)
- in the video we used ADVI to sample the posterior as it is faster than MCMC; the new version of
pymc
allows us to use thenumpyro
sampler which is also fast. If this sampler is not available, just usepm.sample()
(it will take a bit more time).
Questions you should be able to answer before continuing:
- what is the difference between
A*B
andpm.math.dot(A,B)
for two tensorsA
andB
? [hint: how can you find out whatpm.math.dot()
does?]
Evaluating the estimated network
Topics we cover in this video:
- we plot the ELBO to see whether the optimization step in ADVI converged; we get the "inverse J" shape that we would expect
- we the new version of
pymc
we do not use ADVI here and hence do not need to check the ELBO plot
- we the new version of
- since we draw weigths from our posterior, there is not one decision boundary, but we have a number of these; we plot the first 50 of these boundaries
- the boundary is the contour ("iso probability line") corresponding to \(p=0.5\)
- this shows that in some regions we are pretty sure where the boundary lies, while in other regions the boundary varies more across posterior samples
Questions you should be able to answer before continuing:
- just for illustrative purposes: plot the 0.4 and 0.6 contours
What is a confidence interval
Topics we cover in this video:
- we use simulations to show what a 95% confidence interval and a credible interval is
- then we check that the relevant parameter lies in the interval 95% of time
Questions you should be able to answer before continuing:
- do the analysis with different values of
n_sample, precision
; for very small values ofprecision
you may have to increaserepetitions
- if \(\sigma\) is not known, we need to estimate it from the sample; in that case, we need to use the $t$-distribution to calculate the confidence interval (instead of a normal distribution). If necessary use the example here and program this case to check that the confidence interval contains \(\mu\) in 95% of the cases.