Seminar Data Science for Economics

This website contains the material for the MSc course Seminar Data Science for Economics

This website is under construction for 2023/2024

This year the course is taught by:

Course description

In the course Data Science for Economics you will learn how to deal with big data and using simulations to answer policy questions using recent advances in machine learning.

Whether you are a public policy-maker or a data analyst at a private firm, you might be often tasked with providing answers to inherently causal questions. For example: "What is the optimal pricing scheme for our products?" or "Who should receive state benefits?". In such cases you want to use large data to provide convincing policy prescriptions. However, there are different challenges to it such as obtaining the data, cleaning it, collaborating with your co-workers, and most importantly deciding which variables to use in your analysis.

This course will help you overcome those challenges. The "traditional" econometrics you have learned in previous classes provides you with solid knowledge in answering causal questions. Machine learning toolkit, which you will also learn to apply in this class, is primarily targeted to provide the best prediction rather than to answer causal questions. However, when combined together they will help you to work with highly-dimensional datasets, where there can be more variables than observations, and you do not know at the start which variables and interaction terms to include.

In this course, we give an introduction to data-project management, tensors (data with more than 2 dimensions), data simulation, neural networks, cross validation. The goal is to get you up to speed with new developments in datascience applicable for economic analysis.

We use data simulation to get the main intuitions across starting from the difference between correlation and causality, the use of instrumental variables and how to deal with heterogeneous treatment effects. We cover the estimation of neural networks using training, test and validation sets and Bayesian estimation techniques.

In terms of software we will be using python, pymc3 and google's tensorflow.

For this course we are very happy that we partner with Datacamp: register for Datacamp

Required Prerequisites

Students make sure they follow Econometrics 1 and the mandatory course Methods: Python programming for economists.

Screencasts

For this course a series of screencasts is available. Screencasts are available for part 1 and part 2.

Organisation of the course

day date time staff room datacamp topics
Thu 2024-02-01 10:45-12:30 Jan WZ 202 statistical simulation 1-4 Introduction to the course
Thu 2024-02-08 10:45-12:30 Jan WZ 202 keras 1-4 part 1: distributions, bootstrapping
Thu 2024-02-22 10:45-12:30 Jan WZ 202 tensorflow 1-4 OLS, causality
Tue 2024-02-27 08:45-10:30 Jan WZ 202   tensors, first neural network
Thu 2024-02-29 10:45-12:30 Jan WZ 202   over/under fitting, neural network
Tue 2024-03-05 08:45-10:30 Jan WZ 202   treatment effects
Thu 2024-03-07 10:45-12:30 Jan WZ 202   part 2: dealing with data
Tue 2024-03-12 08:45-10:30 Jan WZ 202   Bayesian statistics
Thu 2024-03-14 10:45-12:30 Jan WZ 202 https://www.youtube.com/watch?v=TMmSESkhRtI estimating a Bayesian model, missing data
Tue 2024-03-19 08:45-10:30 Jan WZ 202   Bayesian time series, Bayesian neural network
Thu 2024-03-21 10:45-12:30 Jan WZ 202   confidence interval and Q&A
  • we will see how fast we go, the column "topics" is an indication of what will be discussed in each week
  • for the first part, you will go through this notebook
  • for the second part, through this one
  • there is also a Datacamp course on git: this is recommended but not mandatory

First Lecture

Assignment 1

Do the following steps:

  • if you did Methods: Python programming for economists, you already have a github account, otherwise create a github account
  • go to
    • jupyter lab
      • IT suggests that you use the Firefox browser to access jupyter lab
      • sometimes it helps to access jupyter lab with an incognito/private window
    • or –if all else fails– you can use google's colab
  • create a new python notebook and type the following code in the first cell:
%%bash

git clone https://github.com/janboone/msc_datascience.git
  • then press the Shift key and Enter key as the same time
  • this creates a folder on the server msc_datascience that contains the material for the python part of the course.
  • Note: you can only run this command once. If you run it again, you get an error since the folder already exists.

Final assignment

  • instructions for the final assignment can be found below.

Datacamp

From Datacamp, do the following courses for the first part of the course

A couple of notes on these datacamp courses:

The statistical simulation course starts with very simple statistical concepts. But rapidly things become more challenging. The focus of our seminar will not be on statistical simulation per se, but we will use it to understand the properties of estimators. Hence, it is important to understand the "flow" of having a statistical process and then repeating it 10,000 times to understand its properties. You also learn how to use numpy's statistical functions from numpy.random.

The point for us of this Datacamp course is to become comfortable with modeling data generating processes. Not the specific applications considered in this course.

You may not have seen the get method of a dictionary. Here you see it in action in a simple example (borrowed from stackoverflow):

sentence="The quick brown fox jumped over the lazy dog."
characters={}

for character in sentence:
    characters[character] = characters.get(character, 0) + 1

print(characters)

characters is a dictionary with key a character (including "space") from the sentence and the value equals the number of times the character has occurred up till then. If a character "happens" for the first time, get cannot find it in the dictionary characters and returns the default value (here specified as 0). If character has happened, say, 3 times before, get returns the value 3 and we add 1, so the new value equals 4.

If you run into other functions that you are not familiar with, you can use "?", like in:

np.random.binomial?

Also, you can google, use chatgpt or bard.

Things to take away from this course:

  • how to use random variables in python
  • how to create samples out of a population (e.g. by using np.random.choice)
  • how to model statistical processes (data generating processes)
  • how to use resampling methods like bootstrapping
  • how to use permutation testing
  • how to use simulation for power analysis

This keras course is "hands on" and has a lot of applications. If you prefer a course with some more background on the math of neural networks, you can do this one instead.

Note that for this keras course Chapter 4 is fun but optional.

The tensorflow course gives some more background on the syntax used in tensorflow that we also use in class. All the keras commands you learn in the keras course are easily applied under tensorflow.

For the second part of the course, you can do the following datacamp courses:

if you want to clone this repository in jupyter lab, run the following code on the server:

%%bash

git clone https://github.com/fonnesbeck/intro_stat_modeling_2017.git

Deadlines

The deadline for the final assignment is: Friday June 14th 2024 at 23:59.

The resit deadline for the assignment is: Friday August 16th, 2024. Let us know by email that you have submitted your assignment for the resit. Further, follow the instructions below on how to submit an assignment on github and fill in the google form etc.

Questions

If you have questions/comments about this course, go to the issues page open a new issue (with the green "New issue" button) and type your question. Use a title that is informative (e.g. not "question", but "question about the second assignment"). Go to the next box ("Leave a comment") and type your question. Then click on "Submit new issue". We will answer your question as quickly as possible.

The advantages of the issue page include:

  • if you have a question, other students may have it as well; in this way we answer the questions in a way that everyone can see it. Also before asking the question, you may want to check whether it was asked/answered before on the issue page
  • we answer your question more quickly than when you email us
  • you increase your knowledge of github!

Only when you need to include privately sensitive information ("my cat has passed away"), you can send an email.

In order to post issues, you need to create a github account (which you need anyway to follow this course).

Note that if your question is related to another issue, you can react to the earlier issue and leave a comment in that "conversation".

Assessment material

We have a separate page with all relevant assessment material.

Final Assignment

  • The final assignment you can do alone or with at max. one other student (i.e. max group size is 2).
  • for the deadline of the assignment, see Deadlines above
  • on Canvas we give you the link to the github repos. with the assignment_notebook.ipynb
  • to submit your final assignment:
    • do not change the name of the assignment_notebook.ipynb notebook
    • fill in this google form
    • push the final notebook on the github classroom repository

what we are looking for

The idea of the assignment is that you report your findings in a transparent way that can easily be verified/reproduced by others. The intended audience is your fellow students. They should be able to understand the code you write together with the explanations that you give for this code.

The following ingredients will be important when we evaluate your assignment:

  • Create a "big dataset" from an economic organization providing data; think of:
  • Data handling:
    • download the data to your repos. (in a separate folder "data") and
    • in your notebook create a link to the website of the data source
    • give the code how you merged separate datasets into one big dataset that you use
    • explain what you did (including the code) and why you did the data cleaning steps to get the data from the downloads to the data that you use in the analysis
  • Start your analysis with a clear and transparent question.
  • Briefly motivate why this question is interesting.
  • Explain the methods that you use to answer the question.
    • are your methods based on correlations (only)?
    • do they allow you to make claims about causality?
  • Give the answer that you find (as a preview).
  • Mention the main assumptions that you need to get this answer.
  • Use graphs to introduce your data
  • If you use equations, use latex to make them easy to read.
  • Explain your code, the reader –think of your fellow students– must be able to easily follow what you are doing.
  • How well does your model fit the data?
    • what methods do you use to evaluate this?
  • Present a clear conclusion/answer to your question.
  • Include some discussion of what you find and elements on which you need additional information.

Three remarks:

  • you can copy code from the web; but
    • make sure that you explain the code that you use so that another student of the course understands it and can use it;
    • give the reference of the code that you copy;
  • use common sense: it is not always necessary to have a full blown economic model, but we do expect you to think!
    • in the past we had students looking at the effect of age on income in sports; "theory" suggests that this relation is hump-shaped: 5 year olds and 80 year olds tend not to earn a lot of money as elite athletes; the students presented a scatter plot with a clear hump-shape; then they wrote "now we do a linear regression".
    • for each step that you program, ask yourself why this step makes sense and then explain this in your notebook.
  • show us what you have learned during this course; hence use a number of topics we discussed in your final assignment, for example:
    • simulate data to verify the estimation techniques that you use
    • download your data in the notebook using a python API
    • use pandas to merge different datasets, clean your data, create new variables
    • explain clearly what the causal relations are in your analysis
    • use methods like: ridge and lasso regressions, neural network, Bayesian analysis
      • explain why you use these methods to answer your research question (what are pros and cons of the methods)
      • explain the choices that you make within a method (think of the number of layers and epochs in a neural network)
      • use more than one method and compare the results:
        • discuss what is different and why
    • simply downloading an existing dataset and estimating a neural network on this will not be enough to get a passing grade

resit of final assignment

The resit of the final assignment needs to be a new project compared to the one you handed in before. The easiest way to achieve this is to choose a new research question and a new data set. You can use the same data if you make sure that research question and analysis are sufficiently different from before.

Simply adjusting your first submission based on our feedback will be not be enough.

Apart from this, follow the procedure above on how to submit the assignment and fill in the google form.

Author: Jan Boone