Screencasts for Seminar Datascience for Economics
Under construction
This page contains the screencasts for the first part of the course Seminar Datascience for Economics.
Table of Contents
How are the screencasts supposed to help?
For the first part of the course we use this notebook. The notebook is fairly self-explanatory, but you may get stuck with some of the questions. The screencasts are meant to help you along if you do not know how to proceed. The idea is similar to what we did in Methods: Python programming for economists
Why are there screencasts?
The screencasts are used as complements to the lecturers and classes taught. For some programming problems you need to take time to try to solve them yourself. This we cannot do in class as some students solve them quickly (and get bored) while others need more time to solve these by themselves.
Suggested use of the screencasts
Just watching the screencasts is not going to be very useful for you. The best way to use them is as follows:
- first read the relevant section in the notebook and try to solve the questions by yourself
- if you manage to do so, you are done and there is no need to watch the screencast; if you do not manage to answer all the questions, you have the background for the screencast (which is rather short and does not motivate the underlying economic problem) and you know what problems to look for in the video
- type along with what we do in the screencast. Pause the video if you need more time to type
- experiment with the code that we give you in the video either during the video or after watching it; try different numerical values, use different syntax or a different library; if you get an error, google it to see what is going on etc.
- in general, play around with the python code; that is the best way to understand and learn the language
With screencasts there is –obviously– no interaction between teachers and students. This is a problem when you run into an error message. In class students then immediately raise their hand to understand what is happening.
As this is not possible with a screencast, here are some tips on how to deal with error messages. Also note that as the screencast is typed "live", so there will be error messages in the videos as well and you can see how we deal with these as we go along.
Python error messages:
- start reading the error message at the end; hence scroll down in your notebook to see the last line of the error message. This will usually give a clue of what is happening
- if you do not know what the last line means, use copy/paste to google it or go to stackoverflow directly
- reading "upwards" from the last line, you can trace how the error was generated
Prerequisite knowledge
Economics/Econometrics: this introduction to datascience is part of an MSc degree in Economics. Hence, we assume that you know what OLS is, standard errors, the difference between correlation and causality and you have seen instrumental variables before. In terms of math, you can do some basic linear algebra like multiplying a matrix and a vector and know what a (partial) derivative of a function is.
Python: you have done Methods: Python programming for economists.
Libraries we use
- numpy: basic number crunching and vector manipulation
- pandas to work with dataframes
- statsmodels to do OLS
- pymc: to generate random numbers and do Bayesian analysis
- tensorflow (2.0 or later): we use keras for our neural networks
- scipy: scientific python
- matplotlib: to make plots
Compare jupyter notebook/lab and emacs
Why am I using emacs
You know how to work with JupyterLab on the university servers (or on your own laptop). As with Methods: Python programming for economists, most screencasts will be made with Emacs.
Distributions of an estimator
distribution of sample average and standard deviation
generating data
Topics we cover in this video:
- some estimators have known (analytical) distributions
- this is usually not the case with advanced models in datascience
- hence we need to know how to simulate such distributions
- we first simulate the distribution of a sample average \(m\)
- we use the code
tf.random.uniform([N_simulations,sample_size],0,1)
to generateN_simulations
times a sample of sizesample_size
- this results in a dataset with
N_simulations
rows andsample_size
columns tf.random.uniform?
to see how this function can be used
Questions you should be able to answer before continuing:
- generate data that is drawn from a normal distribution with mean 0 and standard deviation 1
distribution sample mean
Topics we cover in this video:
- calculating the mean across an axis with
np.mean(data, axis = 1)
- making a histogram with matplotlib's
hist
function; settingdensity
andbins
- adding title and labels on the axes
Questions you should be able to answer before continuing:
- change the values for
N_simulations
andsample_size
and plot the histogram of \(m\); what do you find? How does, for instance, the standard deviation in the histogram vary with these parameters?
standard deviation of \(m\)
Topics we cover in this video:
- we show how the standard deviation of \(m\) varies with the sample size \(n\)
- we do this by creating an empty vector
vector_std
and for each sample size we calculate the standard deviation of \(m\) (across ourN_simulations
) and then add this to the vector withvector_std.append()
- using
plt.scatter
we plot the sample size against this standard deviation
Questions you should be able to answer before continuing:
- how do we usually call "the standard deviation of \(m\)"?
- what is the analytical expression for the relation between sample size \(n\) and the standard deviation of \(m\)?
- to find this relation, you can check wikipedia to find the expression for the variance \(V(x)\)
- or to numerically approximate it, you can use
np.std(tf.random.uniform([10000],0,1))
- use python to check that both these approaches give the same answer and explain why
distribution of \(s\)
Topics we cover in this video:
- we calculate the standard deviation of the sample (of 10 draws) across columns with
np.std(data, axis = 1)
- we plot this distribution of
N_simulations
values of \(s\) withplt.hist()
- then we use a boolean expression to calculate the probability that \(s\) is below 0.20
- since
True
is represented in python by 1 (andFalse
by 0),np.sum(data_std < 0.20)
gives the number of case (out ofN_simulations
) where the standard deviation is below 0.20 - dividing this by
N_simulations
(or equivalentlylen(data_std)
) gives us the fraction or probability that \(s<0.20\)
- since
Questions you should be able to answer before continuing:
- calculate the probability that \(s \in [0.20,0.30]\)
distribution of a slope
generating data
Topics we cover in this video:
- how to generate \(x\) observations denoted
simulated_x
and then use these to generate data on \(y\) denotedsimulated_y
of the format \(y = \beta_0 + \beta_1 x + \varepsilon\)
Questions you should be able to answer before continuing:
- what values of \(\beta_0\) and \(\beta_1\) do we use in the video and which values in the notebook?
- how does the distribution of \(\varepsilon\) differ in the video and in the notebook? In particular, what is the standard deviation of \(\varepsilon\) in the notebook? [hint: this is not 1.0]
- generate
simulated_x
using a different distribution than the normal distribution, e.g. use a gamma distribution for \(x\).
doing OLS on your data
Topics we cover in this video:
- make a scatter plot for one of our
N_simulations
datasets - for this dataset do an OLS regression using
statsmodels
(see the website for details on the syntax; but for this course you do not need to learnstatsmodels
) - some libraries, like
statsmodels
, cannot work with tensorflow arrays directly; you can use the.numpy()
method to avoid problems - plot the estimated OLS line in the scatter plot
Questions you should be able to answer before continuing:
- in the same scatter plot, present two datasets and two estimated OLS lines; that is, work with
simulated_x[0,:],simulated_y[0,:]
andsimulated_x[1,:],simulated_y[1,:]
in the same figure
distribution of slopes
Topics we cover in this video:
- use a for-loop to run
N_simulations
OLS regressions - using the
.append()
method to add the estimated slopes and constants to their resp. lists - then plot the slopes with a histogram
- plot the estimated OLS lines in \((x,y)\) space
Questions you should be able to answer before continuing:
- what is the probability that the slope is lower than \(-2.2\)?
bootstrapping
Topics we cover in this video:
- generate your own data: here two data sets \(A\) and \(B\)
- calculate the observed difference in means for the 2 data sets: \(m_{A}, m_B\)
- is the difference \(m_A - m_B\) significant?
- under the null hypothesis that \(A\) and \(B\) were drawn from the same distribution, we can concatenate our data sets \(A\) and \(B\) into one big data set denoted \(AB\)
- out of \(AB\) we generate 10,000 data sets \(\tilde A\) and 10,000 data sets \(\tilde B\) and we calculate the difference \(m_{\tilde A}-m_{\tilde B}\) 10,000 times; hence we get the distribution of our statistic the difference in means between \(A\) and \(B\)
- then we see how likely it is that the difference exceeds the observed difference \(m_A - m_B\)
- if this is not likely, we say that it is not likely that \(A\) and \(B\) were drawn from the same distribution
- we use
tf.concat
to merge the two data sets and np.random.shuffle
to shuffle the rows of the combined data set \(AB\) to generate new samples for \(\tilde A\) and \(\tilde B\)
Questions you should be able to answer before continuing:
- plot the distribution of the differences
- redo this exercise specifying different values for
delta
; for which values ofdelta
(not equal to 1) do you find that the null hypothesis is not rejected? - Suppose you have a data set with a \(y\) column and an \(x\) column. You run a regression of \(y\) on \(x\) and a constant and find that the slope on the \(x\) variable equals 0.05. How can you use bootstrapping to test whether this slope equals 0? [hint: if the slope is zero, what does this say about the rows \((x_i,y_i)\) in your data set?]
Doing your own OLS and lasso regressions
Topics we cover in this video:
- defining tensorflow vectors and using
tf.concat
to create a matrix \(X\) with these vectors as columns - using
tf.ones
to create a column that consists of 1's - then we define the difference between our observations and our OLS line as \(y-Xw\) where \(w\) consists of the constant, the slope of the first variable and the slope of the second variable; for now think (incorrectly) of \(Xw\) as a matrix multiplication (once we get to "broadcasting" you will see what it really is)
- we define a function
loss
which equals the sum of the squared differences between \(y\) and our prediction \(Xw\) - we use
optimize.fmin
to minimize the loss function;fmin
requires the function to be minimized and an initial guess for the variables (here \(w\)) over which we minimize the function - this minimization gives us the OLS estimates of \(w\)
- for lasso and ridge regressions, the \(x\) and \(y\) variables need to be standardized; our \(x\) variables are standardized by the way we defined them (zero mean and standard deviation equal to one); so we only center \(y\) such that the centered variable has mean 0
- we define the loss function for a lasso regression which is a function of the coefficients \(w\) and a penalty term \(\lambda\).
Questions you should be able to answer before continuing:
- minimize
loss-lasso(w,0)
; which coefficients \(w\) do you find? - are they identical to \(w\) minimizing
loss(w)
? Why (not)? - try
w_guess = tf.zeros([3])
andw_guess = tf.zeros([3,1])
in the code. Do both of these work as well? - generate the data with
constant
andslope2
not equal to zero. Then estimate the OLS and lasso coefficients.
Causality
Fork
Topics we cover in this video:
- generate our own data where \(Z\) causes \(X\) and \(Y\) but there is no causal link between \(X\) and \(Y\)
- create a panda's dataframe with the three variables \(x,y,x\) using
pd.DataFrame
and specify a dictionary of the form:{'column name':variable name}
- generate OLS results from the regression \(y = b_0 + b_x x\) using statsmodels (again: you do not need to know statsmodels for this course)
- this regression shows that \(b_x\) is significantly different from 0
- this is correct: there is a strong correlation between \(x\) and \(y\)
- you may be tempted to interpret this as a causal effect of \(x\) on \(y\)
- this is not correct: the way we generated the data in python clearly shows that there is no causal effect of \(x\) on \(y\)
- A fork can be easily solved: just run the regression \(y = b_0 + b_x x + b_z z\) and this will show the unbiased estimate of \(b_x\): in our model we find that after controlling for \(z\), there is no significant effect anymore of \(x\) on \(y\).
Questions you should be able to answer before continuing:
- Write \(y = \beta_0 + \beta_x x + \beta_z z + \varepsilon\); then in the video we consider the case with \(\beta_x = 0\)
- now program your data such that \(\beta_x \neq 0\)
- first run the regression \(y = b_0 + b_x x\): do you find \(b_x = \beta_x\)?
- then run the regression \(y = b_0 + b_x x + b_z z\): how do \(b_x\) and $ βx$ compare now?
Pipe
Topics we cover in this video:
- we generate data where \(X\) causes \(Z\) and \(Z\) causes \(Y\)
- hence there is a causal effect of \(X\) on \(Y\) (via \(Z\))
- running a regression \(y = b_0 + b_x x + b_z z\) shows that \(b_x\) is not significantly different from 0
- hence you could incorrectly infer that \(X\) has no causal effect on \(Y\) (but actually it does in the data that we generated)
- this regression shows that after controlling for \(Z\), \(X\) has no (additional) effect on \(Y\)
- hence with a fork the regression \(y = b_0 + b_x x + b_z z\) suggests the correct causal interpretation of \(X\) on \(Y\) but with a pipe this regression gives the wrong impression of the causal effect of \(X\) on \(Y\): so which one should you use in practice?
- your knowledge of the world should help you figure out whether you are in a fork or pipe "situation" and hence which regression gives the correct suggestion of causal effects.
Questions you should be able to answer before continuing:
- with the data generated in the video, run the regression \(y = b_0 + b_x x\): does this provide the correct value for \(b_x\)
- generate data using \(y = \beta_0 + \beta_x x + \beta_z z + \varepsilon\) with \(\beta_x \neq 0\). Which regression gives the correct size of the causal effect of \(X\) on \(Y\)? Can you determine this correct size analytically?
Collider
Topics we cover in this video:
- we create a dataset with a collider
- we run the regression of \(Y\) on \(X\) and \(Z\) and find a negative effect of \(X\) on \(Y\)
- this is puzzling because in our data \(X\) has a positive effect on \(Z\) and \(Z\) has a positive effect on \(Y\)
- hence where is the negative effect coming from?
- when running a regression like \(y = b_0 + b_x x + b_z z\), the interpretation of \(b_x\) is the effect of \(x\) for a given value of \(z\)
- hence we plot the relation between \(X\) and \(Y\) for a given value of \(Z\): this scatter plot reveals a negative correlation
- when controlling for parent's education \(Z\), a well educated grandparent must have lived in a neighborhood that is not so great; while a grandparent (with the same \(Z\), educational achievement of the parent) who has low education her/himself must have lived in great neighborhood. The former grandparent's grandchild lives in the same bad neighborhood and has low educational achievement while the latter grandchild's educational achievement is boosted by the good neighborhood they live in.
- this explains the negative correlation between \(X\) and \(Y\) controlling for \(Z\)
Questions you should be able to answer before continuing:
- run the regression of \(Y\) on \(X\) only: what effect do you find?
- include the neighborhood effects \(U\) in the dataframe and run the regression of \(Y\) on \(X, Z\) and \(U\): what effects do you find?
Tensors
Introduction
Topics we cover in this video:
- we create a "normal" data set with variables like gdp, inflation, unemployment
- this is basically a matrix: 2 dimensional data
- columns are the variables and rows the observations (e.g. countries with cross section data, or time for a given country in time series data or a combination of countries over time in panel data)
- we download the mnist data set which consists of images of handwritten numbers
- that is, one observation is a handwritten number in 2 dimensions
- hence the data is 3 dimensional: the training data consists of 60,000 observations where each observation is a two dimensional image
- tensors allow us to work with data in higher dimensions than two
Questions you should be able to answer before continuing:
- what is the dimension of
train_labels
? - use
inflation.shape
to see thatinflation
is a two dimensional tensor [hint: the command returns 2 numbers] - but what is the dimension of
inflation
as a vector? [hint: the distinction between dimensions as a vector and as a tensor is confusing at first, but you will get used to it] - check what
train_labels[4]
is.
Creating tensors with numpy
Topics we cover in this video:
- create a tensor in numpy using the
.reshape
method - using the
.shape
and.ndim
methods to determine what the shape of the tensor is what its dimensions are - a 100-dimensional column vector \(x\) turns out to have dimension 1 as a tensor
- add a new dimension to a tensor using
np.newaxis
Questions you should be able to answer before continuing:
- create a vector
y = np.arange(120)
and definey5 = y.reshape(1,2,3,4,5)
- what is the shape of
y5
? And is dimensions? - to get a sense of what
y5
looks like, try things likey5[:,0,0,0,0]
andy5[0,0,:,0,0]
andy5[0,0,:,:,0]
- you can also evaluate
y5
itself and you get the sense of a matrix of matrices; pay attention to the square brackets[]
to see how python delineates dimensions in its output when evaluatingy5
- what is the shape of
Broadcasting
Topics we cover in this video:
- numpy matrix multiplication using the
@
operator - multiplying tensors using broadcasting
- multiplication, addition etc. of tensors is done element by element
- if the two tensors do not have the same shape, numpy uses broadcasting to get the tensors into the same shape
- broadcasting rules are:
- start at the last dimension of the two tensors
- these two dimensions are compatible if
- either they are equal
- or if they are not equal, at least one of them equals 1
- if this is satisfied, move a dimension "to the left" and do the same check
Questions you should be able to answer before continuing:
- create two tensors yourself, e.g. using
np.arange
and.reshape
, and try to add them together; create tensors where this does not work and then usenp.newaxis
to make the tensors compatible; check that you did this correctly by multiplying them and python should not throw an error - check the examples in the notebook and predict –before running the code– whether they can be broadcast together or notebook
slicing and fancy indexing
Topics we cover in this video:
- we create a 2-dimensional tensor \(x\)
- then we select the first element as
x[0,0]
, the last element asx[-1,-1]
x[:,a:b:c]
means we want all rows and then columns starting with indexa
up to (but not including) indexb
and we take stepsc
. If no value ofa
is specified we start at index 0, if nob
is specified we go to the last element and if noc
is specified we take step 1c = -1
implies that we reverse the order- fancy indexing uses boolean masks to select entries from a tensor:
x[x>0]
selects the positive elements out of the tensorx
- this can be used to plot a function where the color varies with the value of the function
Questions you should be able to answer before continuing:
- the slicing and indexing questions in the notebook should now be straightforward to do
- create a 4 dimensional tensor and make selections out the 4 dimensions
First neural network
- After all the work of understanding tensors, the notebook presents a first example of a neural network. At this point you do not need to worry about the syntax (from
keras
) with which we build the neural network. Just go through the notebook and run the cells. The independent variable (\(x\) variable, if you like) is now a two-dimensional figure: a handwritten number. The labels are the number that was handwritten in the figure. The network tries to predict this label based on the figure. The notebook takes you through the steps and checks accuracy. - the \(x\) variable in the train set is called
train_images
which is a 3-dimensional tensor of shape \((60000, 28, 28)\). You cannot work with such \(x\) variables in an OLS regression, but the neural network has no problems with this.
Overfitting and underfitting
generating the data
Topics we cover in this video:
- when doing OLS, adding variables always improves the fit/reduces the mean squared error (mse)
- there are two dangers when you keep adding variables to a regression:
- as discussed above: you start to misinterpret the results in terms of causality
- although mse falls, your predictions become worse at some point
- overfitting is the situation where you add so many variables that your predictions start to suffer
- with underfitting you did not add enough relevant variables and your predictions are less than optimal as well
- you get an idea of the over/underfitting of a model by splitting your data into a train and test data set
- you estimate (train) your model on the train data
- and evaluate your model on the test data
- we use
tf.keras.losses.mse
to evaluate the model; we provide this function with two variables: the observed \(y\) values in the train data (df_train['y']
) and the model prediction on the train data:model.predict(df_train)
Questions you should be able to answer before continuing:
- change the definition of the variable \(y\) by adding terms \(x^3,x^4\) to it. Then run the code again and see what happens to the development of the mse.
- change other parameters like
N_observations
andtrain_size
and see the effects on the development of mse and with the code of the next video on performance on the test set.
overfitting
Topics we cover in this video:
- again we use
tf.keras.losses.mse
to evaluate the model but now with the observed \(y\) values in the test data (df_test['y']
) and the model prediction on the test data:model.predict(df_test)
- adding additional (irrelevant) variables basically destroys the prediction performance of the model on the test data
- in the video the mse for the test data is minimized by including the correct variables in the model (which we know as we generated the data ourselves)
- plotting the higher order models in \((x,y)\) space shows that they try to capture idiosyncratic features of the train data that do not generalize to the test data: this is why their prediction performance deteriorates by including more variables
Questions you should be able to answer before continuing:
- suppose you estimate a model on the train data and get great results on the test data. How can you be sure that this is no coincidence of the way you split the data into train and test data?
- is the following estimation procedure ok?
- fix the train and test data
- estimate a first version of the model on the train data and evaluate the result on the test data
- then add new variables/delete some variables, change the functional form of the equation you estimate (and later on, change the hyper parameters of the model), estimate on the train data and evaluate on the test data
- keep repeating this till you have minimized the mse on the test data
Neural network
perceptron
Topics we cover in this video:
- we use Stephen Marsland's code for the perceptron to understand how weights in a neural network are updated
- weight \(w_0\) is updated according to the rule: \(w_0 = w_0 - \eta(\hat t -t)x\) where \(\eta\) denotes the learning rate, \(\hat t\) is the current prediction of our neural network and \(t\) is the true (correct) target for the point \(x\). Hence, if \(\hat t = t\) there is no need to update \(w_0\) as far as point \(x\) is concerned.
- we illustrate with a simple graph how this updating of weights improves the prediction of our model
Questions you should be able to answer before continuing:
- in the video we consider the case where \(w_1 > 0\); check that this updating works as well in case \(w_1 < 0\)
- another way to update the line in the video is to leave \(w_0\) unchanged and adapt \(w_1\); check that the correction works as well for \(w_1\)
- do the section on the multi-layer perceptron in the notebook; if you have trouble downloading the data, check the video on tensor classification below.
- a great way to get some intuition on the workings of a neural network is to go to the playground
- if you need a break and want to have some fun with machine learning, go and doodle
tensorflow regression
the math
Topics we cover in this video:
- define a tensorflow variable \(z\) with starting value 0:
z = tf.Variable(0.0)
- define a function using tensorflow variables and other variables
- calculate the derivative of a function w.r.t. a tensorflow variable using
tf.GradientTape()
and the.gradient
method - using this derivative to update the tensorflow variables to minimize a function
Questions you should be able to answer before continuing:
- define another function and use the method described in the video to minimize this function
- adapt the method to maximize a function (in particular the part
z.assign_sub
) and use it maximize the function \(f(x,y) = 10 - x^2 - y^2\).
the regression
Topics we cover in this video:
- define a function to generate our own train data
- use tensorflow functions
tf.square
andtf.reduce_mean
to define a loss function of the difference between true data \(y\) and our prediction of \(y\) - minimize this loss function to find the OLS estimates of the slope and intercept.
Questions you should be able to answer before continuing:
- use the function
make_noisy_data
defined in the video to generate other data and estimate the slope and intercept of this data - extend the function
make_noisy_data
to allow for \(y =b+ m_x x + m_z z + e\) and adapt the procedure to estimate \(b,m_x\) and \(m_z\)
tensorflow classification
getting the data ready
Topics we cover in this video:
- getting the data using
urllib.request
- using pandas =.readcsv()=to read the data
- using
.replace()
to replace the flower names (strings) by numbers (integers) - normalizing the features of the data set
- plot the data with different colors for each flower type
Questions you should be able to answer before continuing:
- plot the data for all combinations of features (feature 0 and 1; 0 and 2 etc.) to see which dimensions seem most helpful to classify the data into the different flower types
- compare the data normalization steps that we do here with the ones used in the notebook in the section Multi-layer perceptron: which parts are the same, which differ?
estimating the network
Topics we cover in this video:
- split the data into train and test set
- specify the network using the
keras
syntax - we use two layers with 'relu' activation and the final layer with 'softmax', this gives us prediction probabilities over the 3 flower types in our data
- we then compile the model specifying the optimizer, loss function and the metrics we would like to see during the fitting stage
- we fit the model using the train data (features and targets) and we specify the number of epochs
- as the number of epochs increases, the loss on the train data falls, but this can lead to over-fitting; later we will see how you can determine the optimal number of epochs (avoiding both over-fitting and under-fitting)
- we evaluate the data on the test set.
Questions you should be able to answer before continuing:
- which mistake is made in the video when splitting the data into a train and test set? You can increase the 'epochs' to improve the fit on the train data, but the evaluation on the test set will not really improve. [hint: in the section Multi-layer perceptron in the notebook we use the iris data for the first time. Carefully check the steps we take there: which one did we miss here? check the data to see why this step matters]
- increase the number of epochs and compare the fit on the train data with the fit on the test data
Back to our first neural network
defining the network and fitting it
Topics we cover in this video:
- loading the mnist data
- normalizing our variable
- defining the model using
keras.Sequential
for the different layers - using activations relu and softmax
- in the
compile
step we specify the optimizer, the loss function and other metrics that we want to see when the model is fit to the data - finally we
fit
the model
Questions you should be able to answer before continuing:
- when do you use relu and when softmax activations?
- what is a
Dense
layer? - what is an epoch?
checking the fit
Topics we cover in this video:
- how to evaluate your fitted model on the test data
- with a classification model the prediction is an array with probabilities
- the highest probability in this array gives the most likely label for the observation
Questions you should be able to answer before continuing:
- compare the prediction with the label for 5 different test observations.
number of epochs and overfitting
Topics we cover in this video:
- use the history of
model.fit
to see the model's performance as a function of the number of epochs - plot the loss on the train data and the loss on the validation (or test) data
- the number of epochs where the validation loss "levels off" is the right number of epochs to use
- the loss on the train data keeps falling with the number of epochs beyond this point, but this is due to overfitting
Questions you should be able to answer before continuing:
- make a similar plot for model accuracy and the number of epochs
- experiment with the network architecture to see how this affects the optimal number of epochs:
- increase the number of nodes in a layer
- increase the number of layers in the network
- specify a model that clearly overfits the data
Treatment effects
IV
generating our data
Topics we cover in this video:
- we generate data with no direct (causal) effect of education on wage
- with an OLS estimation we find a positive and significant effect of education on the wage rate
- hence OLS is not the correct estimator to find the causal effect of education on wage
Questions you should be able to answer before continuing:
- define a function that generates the data, runs the OLS and returns the results as a function of the parameters
alpha_w, alpha_e, beta_ew, beta_qe
. - for different values of the parameters, see what the OLS result is; e.g. what happens to the OLS estimation of the effect of education on wage in case
alpha_ew
equals 1.0?
IV estimate
Topics we cover in this video:
- using IV we correctly identify the causal effect of education on wage
- the first stage correctly captures the effect of the instrument
q
on education
Questions you should be able to answer before continuing:
- which properties of
q
make it a valid instrument? - generate data with
alpha_ew
equal to 1; can the IV estimation correctly identify this parameter?
Heterogenous treatment effects
generating our data
Topics we cover in this video:
- use of
np.ones_like, np.zeros_like
- using dictionaries to define effects and functions for different groups
- python can loop over a list of strings (names for the different subgroups)
- we avoid copy/paste of code for different groups by using dictionaries together with a function
- if we change something, we only need to change it once in our code (i.e. not change it for each group which we would have to do if you copy/paste your code)
- with
np.concatenate
we "glue" the vectors for the groups together in columns for the dataframe
Questions you should be able to answer before continuing:
- generate another dataframe
df2
with different values for \(\beta,\tau\) and/or \(n\) - do the analyses below as well for this dataframe and compare results to the analysis with
df
what can we calculate?
Topics we cover in this video:
- with heterogeneous effects, comparing expected earnings with and without training does not give us a straightforward training effect
Questions you should be able to answer before continuing:
- compare expected earnings of individuals with and without an invitation to the training. Does this identify the training effect? [hint: use
df[df.invited==1].earnings
etc.] - do the same when comparing the group (trained and invited) with the group (not trained and not invited)
three cases where we can identify the treatment effect
Topics we cover in this video:
- three scenario's where we can recover the relevant treatment effect
- calculate a conditional probability with
np.sum
over a dataframe column
Questions you should be able to answer before continuing:
- why does the equation of Agrist and Pischke (2009) not work if there are always takers?
Probability of treatment
generating our data
Topics we cover in this video:
- using dictionaries we generate a dataframe for different types
we model a nudge where receiving an explicit invitation increases the probability that the training is finished successfully
Questions you should be able to answer before continuing:
- calculate using the dataframe the effect of the invitation on the training probability; i.e. the mean of
trained
conditional on being invited minus this mean conditional on not being invited. Check that this is close to 0.4.
effect of training on earnings
Topics we cover in this video:
- using the dataframe we determine the effect of training on earnings without observing which individuals successfully finished their training
- we only observe who received a nudge by being explicitly invited to the training
Questions you should be able to answer before continuing:
- use
results_second_stage.params
to see what this returns exactly and what is selected byparams[1]
Part 2
For part 2 of the screencasts, go to this page.