Seminar Data Science for Economics

This website contains the material for the MSc course Seminar Data Science for Economics

This website is under construction for 2019/2020

This year the course is taught by:

Course description

In the course Data Science for Economics you will learn how to deal with big (unstructured) data to answer policy questions using recent advances in machine learning.

Whether you are a public policy-maker or a data analyst at a private firm, you might be often tasked with providing answers to inherently causal questions. For example: "What is the optimal pricing and price differentiation scheme for insurance products?" or "Who should receive state benefits?". In such cases you want to use large individual-level data to provide convincing policy prescriptions. However, there are different challenges to it such as obtaining the data, cleaning it, collaborating with your co-workers, and most importantly deciding which (out of many) variables to use in your analysis.

This course will help you to overcome those challenges. The "traditional" econometrics you have learnt in previous classes provides you with solid knowledge in answering causal questions. Machine learning toolkit, which you will also learn to apply in this class, is primarily targeted to provide the best prediction rather than to answer causal questions. However, when combined together they will help you to work with highly-dimensional datasets, where there are more variables than observations, and you do not know at the start which variables and interaction terms to include.

In this course, we give an introduction to data-project management (e.g. git, sublime), tensors (data with more than 2 dimensions), data simulation, neural networks, cross validation. The goal is to get you up to speed with new developments in datascience applicable for economic analysis.

We use data simulation to get the main intuitions across starting from the difference between correlation and causality, the use of instrumental variables and how to deal with heterogeneous treatment effects. We cover the estimation of neural networks using training, test and validation sets, cross-validation and machine learning to find the best instrumental variables.

Finally, we will cover post-selection inference for causal effects (as in Belloni, Chernozhukov, and Hansen 2013).

In terms of software we will be using python, google's tensorflow, and R. Students must have followed the Econometrics 1 and the python track in AEA 1.

For this course we use the following resources:

Organisation of the course

Lecture schedule

week day date time teacher room topics datacamp
5 Tue 2020-01-28 10:45–12:30 Jan CubeZ 16 distributions, bootstrapping statistical simulation 1,2
5 Thu 2020-01-30 10:45–12:30 Jan CubeZ 221 doing your own OLS statistical simulation 3,4
6 Tue 2020-02-04 10:45–12:30 Jan CubeZ 223 causality keras 1,2
6 Thu 2020-02-06 10:45–12:30 Jan CubeZ 16 tensors, first neural netw. keras 3,4
7 Tue 2020-02-11 10:45–12:30 Jan CubeZ 16 over/underfitting tensorflow 1,2
7 Thu 2020-02-13 10:45–12:30 Jan CubeZ 15 neural network tensorflow 3,4
8 Tue 2020-02-18 10:45–12:30 Jan CubeZ 222 neural network  
8 Thu 2020-02-20 10:45–12:30 Jan CubeZ 214 treatment effects  
10 Tue 2020-03-03 10:45–12:30 Madina CubeZ 16 regularization  
10 Thu 2020-03-05 10:45–12:30 Madina CubeZ 212 post-reg.n for causal infrnce  
11 Tue 2020-03-10 10:45–12:30 Madina CubeZ 222 decision trees  
11 Thu 2020-03-12 10:45–12:30 Madina CubeZ 212 boosting, bagging, r. forest  
12 Tue 2020-03-17 10:45–12:30 Madina CubeZ 17 causal trees  
12 Thu 2020-03-19 10:45–12:30 Madina CubeZ 16 double machine learning  
13 Tue 2020-03-24 10:45–12:30 Madina CubeZ 223 data collection regex, scraping
13 Thu 2020-03-26 10:45–12:30 Madina CubeZ 215 data cleaning tidyverse
  • we will see how fast we go, the column "topics" is an indication of what will be discussed in each week
  • for the first part, taught by Jan, we will go through this notebook
  • the speed at which we go through the notebook will be faster than with AEA 1; hence, prepare for class by going through the notebook beforehand
  • for the second part, taught by Madina, we will go through a series of lectures and tutorials based on the textbook (see above) and the latest techniques in the field of machine learning for causality (e.g., double machine learning, causal trees).
  • if everything goes according to plan, we will dedicate a week for learning data collection and data processing skills to prepare you for the final assignment.

First Lecture

Assignment 1

Do the following four steps:

  • if you did AEA 1, you already have a github account, otherwise create a github account
  • fill in this google form before Friday 7 February 2020
    • sign in with your @tilburguniversity.edu email address and password
  • go to
    • jupyter lab
      • IT suggests that you use the Firefox browser to access jupyter lab
      • sometimes it helps to access jupyter lab with an incognito/private window
    • or –if all else fails– you can use google's colab
  • create a new python notebook and type the following code in the first cell:
%%bash

git clone https://github.com/janboone/msc_datascience.git
  • then press the Shift key and Enter key as the same time
  • this creates a folder on the server msc_datascience that contains the material for the python part of the course.
  • Note: you can only run this command once. If you run it again, you get an error since the folder already exists.

Final assignment

  • instructions for the final assignment can be found below.

Datacamp

From Datacamp, do the following courses for the first part of the course

A couple of notes on these datacamp courses:

The statistical simulation course starts with very simple statistical concepts. But rapidly things become more challenging. The focus of our seminar will not be on statistical simulation per se, but we will use it to understand the properties of estimators. Hence, it is important to understand the "flow" of having a statistical process and then repeating it 10,000 times to understand its properties. You also learn how to use numpy's statistical functions from numpy.random.

The point for us of this Datacamp course is to become comfortable with modelling data generating processes. Not the specific applications considered in this course.

You may not have seen the get method of a dictionary. Here you see it in action in a simple example (borrowed from stackoverflow):

sentence="The quick brown fox jumped over the lazy dog."
characters={}

for character in sentence:
    characters[character] = characters.get(character, 0) + 1

print(characters)
{'T': 1, 'h': 2, 'e': 4, ' ': 8, 'q': 1, 'u': 2, 'i': 1, 'c': 1, 'k': 1, 'b': 1, 'r': 2, 'o': 4, 'w': 1, 'n': 1, 'f': 1, 'x': 1, 'j': 1, 'm': 1, 'p': 1, 'd': 2, 'v': 1, 't': 1, 'l': 1, 'a': 1, 'z': 1, 'y': 1, 'g': 1, '.': 1}

characters is a dictionary with key a character (including "space") from the sentence and the value equals the number of times the character has occured up till then. If a character "happens" for the first time, get cannot find it in the dictionary characters and returns the default value (here specified as 0). If character has happened, say, 3 times before, get returns the value 3 and we add 1, so the new value equals 4.

If you run into other functions that you are not familiar with, you can use "?", like in:

np.random.binomial?

Also, you can google!

Things to take away from this course:

  • how to use random variables in python
  • how to create samples out of a population (e.g. by using np.random.choice)
  • how to model statistical processes (data generating processes)
  • how to use resampling methods like bootstrapping
  • how to use permutation testing
  • how to use simulation for power analysis

This keras course is "hands on" and has a lot of applications. If you prefer a course with some more background on the math of neural networks, you can do this one instead.

Note that for this keras course Chapter 4 is fun but optional.

The tensorflow course gives some more background on the syntax used in tensorflow that we also use in class. All the keras commands you learn in the keras course are easily applied under tensorflow.

Deadlines

The deadline for the final assignment is: Friday June 19th 2020 at 23:59.

The resit deadline for the assignment is: Friday August 14th, 2020. Let us know by email that you have submitted your assignment for the resit.

Questions

If you have questions/comments about this course, go to the issues page open a new issue (with the green "New issue" button) and type your question. Use a title that is informative (e.g. not "question", but "question about the second assignment"). Go to the next box ("Leave a comment") and type your question. Then click on "Submit new issue". We will answer your question as quickly as possible.

The advantages of the issue page include:

  • if you have a question, other students may have it as well; in this way we answer the questions in a way that everyone can see it. Also before asking the question, you may want to check whether it was asked/answered before on the issue page
  • we answer your question more quickly than when you email us
  • you increase your knowledge of github!

Only when you need to include privately sensitive information ("my cat has passed away"), you can send an email.

In order to post issues, you need to create a github account (which you need anyway to follow this course).

Note that if your question is related to another issue, you can react to the earlier issue and leave a comment in that "conversation".

Assessment material

We have a separate page with all relevant assessment material.

Final Assignment

  • The final assignment you can do alone or with at max. one other student (i.e. max group size is 2).
  • for the deadline of the python assignment, see Deadlines above
  • on Canvas we give you the link to the github repos. with the assignment_notebook.ipynb
  • to submit your final assignment:
    • do not change the name of the assignment_notebook.ipynb notebook
    • fill in this google form
    • push the final notebook on the github classroom repository

what we are looking for

The idea of the assignment is that you report your findings in a transparent way that can easily be verified/reproduced by others. The intended audience is your fellow students. They should be able to understand the code you write together with the explanations that you give for this code.

The following ingredients will be important when we evaluate your assignment:

  • Find a "big dataset", e.g. on kaggle, but other sources are fine as well
    • download the data to your repos. (in a separate folder "data") and
    • in your notebook create a link to the website of the data source
  • Start with a clear and transparent question.
  • Briefly motivate why this question is interesting.
  • Explain the methods that you use to answer the question.
    • are your methods based on correlations (only)?
    • do they allow you to make claims about causality?
  • Give the answer that you find (as a preview).
  • Mention the main assumptions that you need to get this answer.
  • Use graphs to introduce your data
  • If you use equations, use latex to make them easy to read.
  • Explain your code, the reader –think of your fellow students– must be able to easily follow what you are doing.
  • How well does your model fit the data?
    • what methods do you use to evaluate this?
  • Present a clear conclusion/answer to your question.
  • Include some discussion of what you find and elements on which you need additional information.

Two remarks:

  • you can copy code from the web; but
    • make sure that you explain the code that you use so that another student of the course understands it and can use it;
    • give the reference of the code that you copy;
  • use common sense: it is not always necessary to have a full blown economic model, but we do expect you to think!
    • in the past we had students looking at the effect of age on income in sports; "theory" suggests that this relation is hump-shaped: 5 year olds and 80 year olds tend not to earn a lot of money as elite athletes; the students presented a scatter plot with a clear hump-shape; then they wrote "now we do a linear regression". For each step that you program, ask yourself why this step makes sense and then explain this in your notebook.

resit of final assignment

The resit of the final assignment needs to be a new project compared to the one you handed in before. The easiest way to achieve this is to choose a new research question and a new data set. You can use the same data if you make sure that research question and analysis are sufficiently different from before.

Simply adjusting your first submission based on our feedback will be not be enough.

Otherwise, follow the procedure above on how to submit the assignment and fill in the google form.

Author: Jan Boone and Madina Kurmangaliyeva