Screencasts for Seminar Datascience for Economics

Under construction

This page contains the screencasts for the first part of the course Seminar Datascience for Economics.

Table of Contents

How are the screencasts supposed to help?

For the first part of the course we use this notebook. The notebook is fairly self-explanatory, but you may get stuck with some of the questions. The screencasts are meant to help you along if you do not know how to proceed. The idea is similar to what we did in Methods: Python programming for economists

Why are there screencasts?

The screencasts are used as complements to the lecturers and classes taught. For some programming problems you need to take time to try to solve them yourself. This we cannot do in class as some students solve them quickly (and get bored) while others need more time to solve these by themselves.

Suggested use of the screencasts

Just watching the screencasts is not going to be very useful for you. The best way to use them is as follows:

  • first read the relevant section in the notebook and try to solve the questions by yourself
  • if you manage to do so, you are done and there is no need to watch the screencast; if you do not manage to answer all the questions, you have the background for the screencast (which is rather short and does not motivate the underlying economic problem) and you know what problems to look for in the video
  • type along with what we do in the screencast. Pause the video if you need more time to type
  • experiment with the code that we give you in the video either during the video or after watching it; try different numerical values, use different syntax or a different library; if you get an error, google it to see what is going on etc.
  • in general, play around with the python code; that is the best way to understand and learn the language

With screencasts there is –obviously– no interaction between teachers and students. This is a problem when you run into an error message. In class students then immediately raise their hand to understand what is happening.

As this is not possible with a screencast, here are some tips on how to deal with error messages. Also note that as the screencast is typed "live", so there will be error messages in the videos as well and you can see how we deal with these as we go along.

Python error messages:

  • start reading the error message at the end; hence scroll down in your notebook to see the last line of the error message. This will usually give a clue of what is happening
  • if you do not know what the last line means, use copy/paste to google it or go to stackoverflow directly
  • reading "upwards" from the last line, you can trace how the error was generated

Prerequisite knowledge

Economics/Econometrics: this introduction to datascience is part of an MSc degree in Economics. Hence, we assume that you know what OLS is, standard errors, the difference between correlation and causality and you have seen instrumental variables before. In terms of math, you can do some basic linear algebra like multiplying a matrix and a vector and know what a (partial) derivative of a function is.

Python: you have done Methods: Python programming for economists.

Libraries we use

  • numpy: basic number crunching and vector manipulation
  • pandas to work with dataframes
  • statsmodels to do OLS
  • pymc: to generate random numbers and do Bayesian analysis
  • tensorflow (2.0 or later): we use keras for our neural networks
  • scipy: scientific python
  • matplotlib: to make plots

Compare jupyter notebook/lab and emacs

Why am I using emacs

You know how to work with JupyterLab on the university servers (or on your own laptop). As with Methods: Python programming for economists, most screencasts will be made with Emacs.

Distributions of an estimator

distribution of sample average and standard deviation

generating data

Topics we cover in this video:

  • some estimators have known (analytical) distributions
  • this is usually not the case with advanced models in datascience
  • hence we need to know how to simulate such distributions
  • we first simulate the distribution of a sample average \(m\)
  • we use the code tf.random.uniform([N_simulations,sample_size],0,1) to generate N_simulations times a sample of size sample_size
  • this results in a dataset with N_simulations rows and sample_size columns
  • tf.random.uniform? to see how this function can be used

Questions you should be able to answer before continuing:

  • generate data that is drawn from a normal distribution with mean 0 and standard deviation 1

distribution sample mean

Topics we cover in this video:

  • calculating the mean across an axis with np.mean(data, axis = 1)
  • making a histogram with matplotlib's hist function; setting density and bins
  • adding title and labels on the axes

Questions you should be able to answer before continuing:

  • change the values for N_simulations and sample_size and plot the histogram of \(m\); what do you find? How does, for instance, the standard deviation in the histogram vary with these parameters?

standard deviation of \(m\)

Topics we cover in this video:

  • we show how the standard deviation of \(m\) varies with the sample size \(n\)
  • we do this by creating an empty vector vector_std and for each sample size we calculate the standard deviation of \(m\) (across our N_simulations) and then add this to the vector with vector_std.append()
  • using plt.scatter we plot the sample size against this standard deviation

Questions you should be able to answer before continuing:

  • how do we usually call "the standard deviation of \(m\)"?
  • what is the analytical expression for the relation between sample size \(n\) and the standard deviation of \(m\)?
    • to find this relation, you can check wikipedia to find the expression for the variance \(V(x)\)
    • or to numerically approximate it, you can use np.std(tf.random.uniform([10000],0,1))
    • use python to check that both these approaches give the same answer and explain why

distribution of \(s\)

Topics we cover in this video:

  • we calculate the standard deviation of the sample (of 10 draws) across columns with np.std(data, axis = 1)
  • we plot this distribution of N_simulations values of \(s\) with plt.hist()
  • then we use a boolean expression to calculate the probability that \(s\) is below 0.20
    • since True is represented in python by 1 (and False by 0), np.sum(data_std < 0.20) gives the number of case (out of N_simulations) where the standard deviation is below 0.20
    • dividing this by N_simulations (or equivalently len(data_std)) gives us the fraction or probability that \(s<0.20\)

Questions you should be able to answer before continuing:

  • calculate the probability that \(s \in [0.20,0.30]\)

distribution of a slope

generating data

Topics we cover in this video:

  • how to generate \(x\) observations denoted simulated_x and then use these to generate data on \(y\) denoted simulated_y of the format \(y = \beta_0 + \beta_1 x + \varepsilon\)

Questions you should be able to answer before continuing:

  • what values of \(\beta_0\) and \(\beta_1\) do we use in the video and which values in the notebook?
  • how does the distribution of \(\varepsilon\) differ in the video and in the notebook? In particular, what is the standard deviation of \(\varepsilon\) in the notebook? [hint: this is not 1.0]
  • generate simulated_x using a different distribution than the normal distribution, e.g. use a gamma distribution for \(x\).

doing OLS on your data

Topics we cover in this video:

  • make a scatter plot for one of our N_simulations datasets
  • for this dataset do an OLS regression using statsmodels (see the website for details on the syntax; but for this course you do not need to learn statsmodels)
  • some libraries, like statsmodels, cannot work with tensorflow arrays directly; you can use the .numpy() method to avoid problems
  • plot the estimated OLS line in the scatter plot

Questions you should be able to answer before continuing:

  • in the same scatter plot, present two datasets and two estimated OLS lines; that is, work with simulated_x[0,:],simulated_y[0,:] and simulated_x[1,:],simulated_y[1,:] in the same figure

distribution of slopes

Topics we cover in this video:

  • use a for-loop to run N_simulations OLS regressions
  • using the .append() method to add the estimated slopes and constants to their resp. lists
  • then plot the slopes with a histogram
  • plot the estimated OLS lines in \((x,y)\) space

Questions you should be able to answer before continuing:

  • what is the probability that the slope is lower than \(-2.2\)?

bootstrapping

Topics we cover in this video:

  • generate your own data: here two data sets \(A\) and \(B\)
  • calculate the observed difference in means for the 2 data sets: \(m_{A}, m_B\)
  • is the difference \(m_A - m_B\) significant?
  • under the null hypothesis that \(A\) and \(B\) were drawn from the same distribution, we can concatenate our data sets \(A\) and \(B\) into one big data set denoted \(AB\)
  • out of \(AB\) we generate 10,000 data sets \(\tilde A\) and 10,000 data sets \(\tilde B\) and we calculate the difference \(m_{\tilde A}-m_{\tilde B}\) 10,000 times; hence we get the distribution of our statistic the difference in means between \(A\) and \(B\)
  • then we see how likely it is that the difference exceeds the observed difference \(m_A - m_B\)
  • if this is not likely, we say that it is not likely that \(A\) and \(B\) were drawn from the same distribution
  • we use tf.concat to merge the two data sets and
  • np.random.shuffle to shuffle the rows of the combined data set \(AB\) to generate new samples for \(\tilde A\) and \(\tilde B\)

Questions you should be able to answer before continuing:

  • plot the distribution of the differences
  • redo this exercise specifying different values for delta; for which values of delta (not equal to 1) do you find that the null hypothesis is not rejected?
  • Suppose you have a data set with a \(y\) column and an \(x\) column. You run a regression of \(y\) on \(x\) and a constant and find that the slope on the \(x\) variable equals 0.05. How can you use bootstrapping to test whether this slope equals 0? [hint: if the slope is zero, what does this say about the rows \((x_i,y_i)\) in your data set?]

Doing your own OLS and lasso regressions

Topics we cover in this video:

  • defining tensorflow vectors and using tf.concat to create a matrix \(X\) with these vectors as columns
  • using tf.ones to create a column that consists of 1's
  • then we define the difference between our observations and our OLS line as \(y-Xw\) where \(w\) consists of the constant, the slope of the first variable and the slope of the second variable; for now think (incorrectly) of \(Xw\) as a matrix multiplication (once we get to "broadcasting" you will see what it really is)
  • we define a function loss which equals the sum of the squared differences between \(y\) and our prediction \(Xw\)
  • we use optimize.fmin to minimize the loss function; fmin requires the function to be minimized and an initial guess for the variables (here \(w\)) over which we minimize the function
  • this minimization gives us the OLS estimates of \(w\)
  • for lasso and ridge regressions, the \(x\) and \(y\) variables need to be standardized; our \(x\) variables are standardized by the way we defined them (zero mean and standard deviation equal to one); so we only center \(y\) such that the centered variable has mean 0
  • we define the loss function for a lasso regression which is a function of the coefficients \(w\) and a penalty term \(\lambda\).

Questions you should be able to answer before continuing:

  • minimize loss-lasso(w,0); which coefficients \(w\) do you find?
  • are they identical to \(w\) minimizing loss(w)? Why (not)?
  • try w_guess = tf.zeros([3]) and w_guess = tf.zeros([3,1]) in the code. Do both of these work as well?
  • generate the data with constant and slope2 not equal to zero. Then estimate the OLS and lasso coefficients.

Causality

Fork

Topics we cover in this video:

  • generate our own data where \(Z\) causes \(X\) and \(Y\) but there is no causal link between \(X\) and \(Y\)
  • create a panda's dataframe with the three variables \(x,y,x\) using pd.DataFrame and specify a dictionary of the form: {'column name':variable name}
  • generate OLS results from the regression \(y = b_0 + b_x x\) using statsmodels (again: you do not need to know statsmodels for this course)
  • this regression shows that \(b_x\) is significantly different from 0
    • this is correct: there is a strong correlation between \(x\) and \(y\)
  • you may be tempted to interpret this as a causal effect of \(x\) on \(y\)
    • this is not correct: the way we generated the data in python clearly shows that there is no causal effect of \(x\) on \(y\)
  • A fork can be easily solved: just run the regression \(y = b_0 + b_x x + b_z z\) and this will show the unbiased estimate of \(b_x\): in our model we find that after controlling for \(z\), there is no significant effect anymore of \(x\) on \(y\).

Questions you should be able to answer before continuing:

  • Write \(y = \beta_0 + \beta_x x + \beta_z z + \varepsilon\); then in the video we consider the case with \(\beta_x = 0\)
    • now program your data such that \(\beta_x \neq 0\)
    • first run the regression \(y = b_0 + b_x x\): do you find \(b_x = \beta_x\)?
    • then run the regression \(y = b_0 + b_x x + b_z z\): how do \(b_x\) and $ βx$ compare now?

Pipe

Topics we cover in this video:

  • we generate data where \(X\) causes \(Z\) and \(Z\) causes \(Y\)
  • hence there is a causal effect of \(X\) on \(Y\) (via \(Z\))
  • running a regression \(y = b_0 + b_x x + b_z z\) shows that \(b_x\) is not significantly different from 0
    • hence you could incorrectly infer that \(X\) has no causal effect on \(Y\) (but actually it does in the data that we generated)
  • this regression shows that after controlling for \(Z\), \(X\) has no (additional) effect on \(Y\)
  • hence with a fork the regression \(y = b_0 + b_x x + b_z z\) suggests the correct causal interpretation of \(X\) on \(Y\) but with a pipe this regression gives the wrong impression of the causal effect of \(X\) on \(Y\): so which one should you use in practice?
  • your knowledge of the world should help you figure out whether you are in a fork or pipe "situation" and hence which regression gives the correct suggestion of causal effects.

Questions you should be able to answer before continuing:

  • with the data generated in the video, run the regression \(y = b_0 + b_x x\): does this provide the correct value for \(b_x\)
  • generate data using \(y = \beta_0 + \beta_x x + \beta_z z + \varepsilon\) with \(\beta_x \neq 0\). Which regression gives the correct size of the causal effect of \(X\) on \(Y\)? Can you determine this correct size analytically?

Collider

Topics we cover in this video:

  • we create a dataset with a collider
  • we run the regression of \(Y\) on \(X\) and \(Z\) and find a negative effect of \(X\) on \(Y\)
    • this is puzzling because in our data \(X\) has a positive effect on \(Z\) and \(Z\) has a positive effect on \(Y\)
    • hence where is the negative effect coming from?
  • when running a regression like \(y = b_0 + b_x x + b_z z\), the interpretation of \(b_x\) is the effect of \(x\) for a given value of \(z\)
  • hence we plot the relation between \(X\) and \(Y\) for a given value of \(Z\): this scatter plot reveals a negative correlation
  • when controlling for parent's education \(Z\), a well educated grandparent must have lived in a neighborhood that is not so great; while a grandparent (with the same \(Z\), educational achievement of the parent) who has low education her/himself must have lived in great neighborhood. The former grandparent's grandchild lives in the same bad neighborhood and has low educational achievement while the latter grandchild's educational achievement is boosted by the good neighborhood they live in.
  • this explains the negative correlation between \(X\) and \(Y\) controlling for \(Z\)

Questions you should be able to answer before continuing:

  • run the regression of \(Y\) on \(X\) only: what effect do you find?
  • include the neighborhood effects \(U\) in the dataframe and run the regression of \(Y\) on \(X, Z\) and \(U\): what effects do you find?

Tensors

Introduction

Topics we cover in this video:

  • we create a "normal" data set with variables like gdp, inflation, unemployment
    • this is basically a matrix: 2 dimensional data
    • columns are the variables and rows the observations (e.g. countries with cross section data, or time for a given country in time series data or a combination of countries over time in panel data)
  • we download the mnist data set which consists of images of handwritten numbers
    • that is, one observation is a handwritten number in 2 dimensions
    • hence the data is 3 dimensional: the training data consists of 60,000 observations where each observation is a two dimensional image
    • tensors allow us to work with data in higher dimensions than two

Questions you should be able to answer before continuing:

  • what is the dimension of train_labels?
  • use inflation.shape to see that inflation is a two dimensional tensor [hint: the command returns 2 numbers]
  • but what is the dimension of inflation as a vector? [hint: the distinction between dimensions as a vector and as a tensor is confusing at first, but you will get used to it]
  • check what train_labels[4] is.

Creating tensors with numpy

Topics we cover in this video:

  • create a tensor in numpy using the .reshape method
  • using the .shape and .ndim methods to determine what the shape of the tensor is what its dimensions are
  • a 100-dimensional column vector \(x\) turns out to have dimension 1 as a tensor
  • add a new dimension to a tensor using np.newaxis

Questions you should be able to answer before continuing:

  • create a vector y = np.arange(120) and define y5 = y.reshape(1,2,3,4,5)
    • what is the shape of y5? And is dimensions?
    • to get a sense of what y5 looks like, try things like y5[:,0,0,0,0] and y5[0,0,:,0,0] and y5[0,0,:,:,0]
    • you can also evaluate y5 itself and you get the sense of a matrix of matrices; pay attention to the square brackets [] to see how python delineates dimensions in its output when evaluating y5

Broadcasting

Topics we cover in this video:

  • numpy matrix multiplication using the @ operator
  • multiplying tensors using broadcasting
  • multiplication, addition etc. of tensors is done element by element
  • if the two tensors do not have the same shape, numpy uses broadcasting to get the tensors into the same shape
  • broadcasting rules are:
    • start at the last dimension of the two tensors
    • these two dimensions are compatible if
      • either they are equal
      • or if they are not equal, at least one of them equals 1
    • if this is satisfied, move a dimension "to the left" and do the same check

Questions you should be able to answer before continuing:

  • create two tensors yourself, e.g. using np.arange and .reshape, and try to add them together; create tensors where this does not work and then use np.newaxis to make the tensors compatible; check that you did this correctly by multiplying them and python should not throw an error
  • check the examples in the notebook and predict –before running the code– whether they can be broadcast together or notebook

slicing and fancy indexing

Topics we cover in this video:

  • we create a 2-dimensional tensor \(x\)
  • then we select the first element as x[0,0], the last element as x[-1,-1]
  • x[:,a:b:c] means we want all rows and then columns starting with index a up to (but not including) index b and we take steps c. If no value of a is specified we start at index 0, if no b is specified we go to the last element and if no c is specified we take step 1
  • c = -1 implies that we reverse the order
  • fancy indexing uses boolean masks to select entries from a tensor: x[x>0] selects the positive elements out of the tensor x
  • this can be used to plot a function where the color varies with the value of the function

Questions you should be able to answer before continuing:

  • the slicing and indexing questions in the notebook should now be straightforward to do
  • create a 4 dimensional tensor and make selections out the 4 dimensions

First neural network

  • After all the work of understanding tensors, the notebook presents a first example of a neural network. At this point you do not need to worry about the syntax (from keras) with which we build the neural network. Just go through the notebook and run the cells. The independent variable (\(x\) variable, if you like) is now a two-dimensional figure: a handwritten number. The labels are the number that was handwritten in the figure. The network tries to predict this label based on the figure. The notebook takes you through the steps and checks accuracy.
  • the \(x\) variable in the train set is called train_images which is a 3-dimensional tensor of shape \((60000, 28, 28)\). You cannot work with such \(x\) variables in an OLS regression, but the neural network has no problems with this.

Overfitting and underfitting

generating the data

Topics we cover in this video:

  • when doing OLS, adding variables always improves the fit/reduces the mean squared error (mse)
  • there are two dangers when you keep adding variables to a regression:
    • as discussed above: you start to misinterpret the results in terms of causality
    • although mse falls, your predictions become worse at some point
  • overfitting is the situation where you add so many variables that your predictions start to suffer
  • with underfitting you did not add enough relevant variables and your predictions are less than optimal as well
  • you get an idea of the over/underfitting of a model by splitting your data into a train and test data set
    • you estimate (train) your model on the train data
    • and evaluate your model on the test data
  • we use tf.keras.losses.mse to evaluate the model; we provide this function with two variables: the observed \(y\) values in the train data (df_train['y']) and the model prediction on the train data: model.predict(df_train)

Questions you should be able to answer before continuing:

  • change the definition of the variable \(y\) by adding terms \(x^3,x^4\) to it. Then run the code again and see what happens to the development of the mse.
  • change other parameters like N_observations and train_size and see the effects on the development of mse and with the code of the next video on performance on the test set.

overfitting

Topics we cover in this video:

  • again we use tf.keras.losses.mse to evaluate the model but now with the observed \(y\) values in the test data (df_test['y']) and the model prediction on the test data: model.predict(df_test)
  • adding additional (irrelevant) variables basically destroys the prediction performance of the model on the test data
  • in the video the mse for the test data is minimized by including the correct variables in the model (which we know as we generated the data ourselves)
  • plotting the higher order models in \((x,y)\) space shows that they try to capture idiosyncratic features of the train data that do not generalize to the test data: this is why their prediction performance deteriorates by including more variables

Questions you should be able to answer before continuing:

  • suppose you estimate a model on the train data and get great results on the test data. How can you be sure that this is no coincidence of the way you split the data into train and test data?
  • is the following estimation procedure ok?
    • fix the train and test data
    • estimate a first version of the model on the train data and evaluate the result on the test data
    • then add new variables/delete some variables, change the functional form of the equation you estimate (and later on, change the hyper parameters of the model), estimate on the train data and evaluate on the test data
    • keep repeating this till you have minimized the mse on the test data

Neural network

perceptron

Topics we cover in this video:

  • we use Stephen Marsland's code for the perceptron to understand how weights in a neural network are updated
  • weight \(w_0\) is updated according to the rule: \(w_0 = w_0 - \eta(\hat t -t)x\) where \(\eta\) denotes the learning rate, \(\hat t\) is the current prediction of our neural network and \(t\) is the true (correct) target for the point \(x\). Hence, if \(\hat t = t\) there is no need to update \(w_0\) as far as point \(x\) is concerned.
  • we illustrate with a simple graph how this updating of weights improves the prediction of our model

Questions you should be able to answer before continuing:

  • in the video we consider the case where \(w_1 > 0\); check that this updating works as well in case \(w_1 < 0\)
  • another way to update the line in the video is to leave \(w_0\) unchanged and adapt \(w_1\); check that the correction works as well for \(w_1\)
  • do the section on the multi-layer perceptron in the notebook; if you have trouble downloading the data, check the video on tensor classification below.
  • a great way to get some intuition on the workings of a neural network is to go to the playground
  • if you need a break and want to have some fun with machine learning, go and doodle

tensorflow regression

the math

Topics we cover in this video:

  • define a tensorflow variable \(z\) with starting value 0: z = tf.Variable(0.0)
  • define a function using tensorflow variables and other variables
  • calculate the derivative of a function w.r.t. a tensorflow variable using tf.GradientTape() and the .gradient method
  • using this derivative to update the tensorflow variables to minimize a function

Questions you should be able to answer before continuing:

  • define another function and use the method described in the video to minimize this function
  • adapt the method to maximize a function (in particular the part z.assign_sub) and use it maximize the function \(f(x,y) = 10 - x^2 - y^2\).

the regression

Topics we cover in this video:

  • define a function to generate our own train data
  • use tensorflow functions tf.square and tf.reduce_mean to define a loss function of the difference between true data \(y\) and our prediction of \(y\)
  • minimize this loss function to find the OLS estimates of the slope and intercept.

Questions you should be able to answer before continuing:

  • use the function make_noisy_data defined in the video to generate other data and estimate the slope and intercept of this data
  • extend the function make_noisy_data to allow for \(y =b+ m_x x + m_z z + e\) and adapt the procedure to estimate \(b,m_x\) and \(m_z\)

tensorflow classification

getting the data ready

Topics we cover in this video:

  • getting the data using urllib.request
  • using pandas =.readcsv()=to read the data
  • using .replace() to replace the flower names (strings) by numbers (integers)
  • normalizing the features of the data set
  • plot the data with different colors for each flower type

Questions you should be able to answer before continuing:

  • plot the data for all combinations of features (feature 0 and 1; 0 and 2 etc.) to see which dimensions seem most helpful to classify the data into the different flower types
  • compare the data normalization steps that we do here with the ones used in the notebook in the section Multi-layer perceptron: which parts are the same, which differ?

estimating the network

Topics we cover in this video:

  • split the data into train and test set
  • specify the network using the keras syntax
  • we use two layers with 'relu' activation and the final layer with 'softmax', this gives us prediction probabilities over the 3 flower types in our data
  • we then compile the model specifying the optimizer, loss function and the metrics we would like to see during the fitting stage
  • we fit the model using the train data (features and targets) and we specify the number of epochs
    • as the number of epochs increases, the loss on the train data falls, but this can lead to over-fitting; later we will see how you can determine the optimal number of epochs (avoiding both over-fitting and under-fitting)
  • we evaluate the data on the test set.

Questions you should be able to answer before continuing:

  • which mistake is made in the video when splitting the data into a train and test set? You can increase the 'epochs' to improve the fit on the train data, but the evaluation on the test set will not really improve. [hint: in the section Multi-layer perceptron in the notebook we use the iris data for the first time. Carefully check the steps we take there: which one did we miss here? check the data to see why this step matters]
  • increase the number of epochs and compare the fit on the train data with the fit on the test data

Back to our first neural network

defining the network and fitting it

Topics we cover in this video:

  • loading the mnist data
  • normalizing our variable
  • defining the model using keras.Sequential for the different layers
  • using activations relu and softmax
  • in the compile step we specify the optimizer, the loss function and other metrics that we want to see when the model is fit to the data
  • finally we fit the model

Questions you should be able to answer before continuing:

  • when do you use relu and when softmax activations?
  • what is a Dense layer?
  • what is an epoch?

checking the fit

Topics we cover in this video:

  • how to evaluate your fitted model on the test data
  • with a classification model the prediction is an array with probabilities
  • the highest probability in this array gives the most likely label for the observation

Questions you should be able to answer before continuing:

  • compare the prediction with the label for 5 different test observations.

number of epochs and overfitting

Topics we cover in this video:

  • use the history of model.fit to see the model's performance as a function of the number of epochs
  • plot the loss on the train data and the loss on the validation (or test) data
  • the number of epochs where the validation loss "levels off" is the right number of epochs to use
  • the loss on the train data keeps falling with the number of epochs beyond this point, but this is due to overfitting

Questions you should be able to answer before continuing:

  • make a similar plot for model accuracy and the number of epochs
  • experiment with the network architecture to see how this affects the optimal number of epochs:
    • increase the number of nodes in a layer
    • increase the number of layers in the network
  • specify a model that clearly overfits the data

Treatment effects

IV

generating our data

Topics we cover in this video:

  • we generate data with no direct (causal) effect of education on wage
  • with an OLS estimation we find a positive and significant effect of education on the wage rate
  • hence OLS is not the correct estimator to find the causal effect of education on wage

Questions you should be able to answer before continuing:

  • define a function that generates the data, runs the OLS and returns the results as a function of the parameters alpha_w, alpha_e, beta_ew, beta_qe.
  • for different values of the parameters, see what the OLS result is; e.g. what happens to the OLS estimation of the effect of education on wage in case alpha_ew equals 1.0?

IV estimate

Topics we cover in this video:

  • using IV we correctly identify the causal effect of education on wage
  • the first stage correctly captures the effect of the instrument q on education

Questions you should be able to answer before continuing:

  • which properties of q make it a valid instrument?
  • generate data with alpha_ew equal to 1; can the IV estimation correctly identify this parameter?

Heterogenous treatment effects

generating our data

Topics we cover in this video:

  • use of np.ones_like, np.zeros_like
  • using dictionaries to define effects and functions for different groups
  • python can loop over a list of strings (names for the different subgroups)
  • we avoid copy/paste of code for different groups by using dictionaries together with a function
    • if we change something, we only need to change it once in our code (i.e. not change it for each group which we would have to do if you copy/paste your code)
  • with np.concatenate we "glue" the vectors for the groups together in columns for the dataframe

Questions you should be able to answer before continuing:

  • generate another dataframe df2 with different values for \(\beta,\tau\) and/or \(n\)
  • do the analyses below as well for this dataframe and compare results to the analysis with df

what can we calculate?

Topics we cover in this video:

  • with heterogeneous effects, comparing expected earnings with and without training does not give us a straightforward training effect

Questions you should be able to answer before continuing:

  • compare expected earnings of individuals with and without an invitation to the training. Does this identify the training effect? [hint: use df[df.invited==1].earnings etc.]
  • do the same when comparing the group (trained and invited) with the group (not trained and not invited)

three cases where we can identify the treatment effect

Topics we cover in this video:

  • three scenario's where we can recover the relevant treatment effect
  • calculate a conditional probability with np.sum over a dataframe column

Questions you should be able to answer before continuing:

  • why does the equation of Agrist and Pischke (2009) not work if there are always takers?

Probability of treatment

generating our data

Topics we cover in this video:

  • using dictionaries we generate a dataframe for different types
  • we model a nudge where receiving an explicit invitation increases the probability that the training is finished successfully

    Questions you should be able to answer before continuing:

  • calculate using the dataframe the effect of the invitation on the training probability; i.e. the mean of trained conditional on being invited minus this mean conditional on not being invited. Check that this is close to 0.4.

effect of training on earnings

Topics we cover in this video:

  • using the dataframe we determine the effect of training on earnings without observing which individuals successfully finished their training
  • we only observe who received a nudge by being explicitly invited to the training

Questions you should be able to answer before continuing:

  • use results_second_stage.params to see what this returns exactly and what is selected by params[1]

Part 2

For part 2 of the screencasts, go to this page.

Author: Jan Boone