Seminar Data Science for Economics
This website contains the material for the MSc course Seminar Data Science for Economics
This website is under construction for 2024/2025
This year the course is taught by:
- Jan Boone
Course description
In the course Data Science for Economics you will learn how to deal with big data and using simulations to answer policy questions using recent advances in machine learning.
Whether you are a public policy-maker or a data analyst at a private firm, you might be tasked with providing answers to questions like "What is the optimal pricing scheme for our products?" or "Who should receive state benefits?". In such cases you want to use large data to provide convincing policy prescriptions. However, there are different challenges to it such as obtaining the data, cleaning it, collaborating with your co-workers, and most importantly deciding which variables to use in your analysis.
This course will help you overcome those challenges. The "traditional" econometrics you have learned in previous classes provides you with solid knowledge in answering causal questions. Machine learning toolkit, which you will also learn to apply in this class, is primarily targeted to provide the best prediction rather than to answer causal questions. However, when combined together they will help you to work with highly-dimensional datasets, where there can be more variables than observations, and you do not know at the start which variables and interaction terms to include.
In this course, we give an introduction to data-project management, tensors (data with more than 2 dimensions), data simulation, neural networks, cross validation. The goal is to get you up to speed with developments in datascience applicable for economic analysis.
We use data simulation to get the main intuitions across starting from the difference between correlation and causality, the use of instrumental variables and how to deal with heterogeneous treatment effects. We cover the estimation of neural networks using training, test and validation sets and Bayesian estimation techniques.
In terms of software we will be using python, pymc and google's tensorflow.
For this course we are very happy that we partner with Datacamp: register for Datacamp
Required Prerequisites
Students make sure they follow Econometrics 1 and the mandatory course Methods: Python programming for economists.
Table of Contents
Screencasts
Organisation of the course
day | date | time | staff | room | topic | datacamp | assignment |
---|---|---|---|---|---|---|---|
Tue | 2025-01-28 | 12:45-14:30 | Jan | Cube 221 | Introduction to the course | statistical simulation 1-4 | |
Tue | 2025-02-04 | 12:45-14:30 | Jan | Cube 221 | part 1: distributions, bootstrapping | keras 1-4 | |
Tue | 2025-02-11 | 12:45-14:30 | Jan | Cube 221 | OLS, causality | tensorflow 1-4 | |
Tue | 2025-02-18 | 12:45-14:30 | Jan | Cube 221 | tensors, first neural network | ||
Thu | 2025-02-20 | 14:45-16:30 | Jan | Cube 221 | over/under fitting, neural network | ||
Tue | 2025-02-25 | 12:45-14:30 | Jan | Cube 221 | treatment effects | ||
Thu | 2025-02-27 | 14:45-16:30 | Jan | Cube 221 | part 2: dealing with data | ||
Tue | 2025-03-11 | 12:45-14:30 | Jan | Cube 221 | Bayesian statistics | ||
Thu | 2025-03-13 | 14:45-16:30 | Jan | Cube 221 | estimating a Bayesian model, missing data | https://www.youtube.com/watch?v=TMmSESkhRtI | |
Tue | 2025-03-18 | 12:45-14:30 | Jan | Cube 221 | Bayesian time series, Bayesian neural network | ||
Thu | 2025-03-20 | 14:45-16:30 | Jan | Cube 17 | confidence interval and Q&A | ||
Fri | 2025-04-25 | deadline | |||||
Fri | 2025-06-20 | resit |
- we will see how fast we go, the column "topics" is an indication of what will be discussed in each week
- for the first part, you will go through this notebook
- for the second part, through this one
- there is also a Datacamp course on git: this is recommended but not mandatory
First Lecture
Assignment 1
Do the following steps:
- if you did Methods: Python programming for economists, you already have a github account, otherwise create a github account
- go to
- jupyter lab
- IT suggests that you use the Firefox browser to access jupyter lab
- sometimes it helps to access jupyter lab with an incognito/private window
- or –if all else fails– you can use google's colab
- jupyter lab
- create a new python notebook and type the following code in the first cell:
%%bash git clone https://github.com/janboone/msc_datascience.git
- then press the Shift key and Enter key as the same time
- this creates a folder on the server
msc_datascience
that contains the material for the python part of the course. - Note: you can only run this command once. If you run it again, you get an error since the folder already exists.
Final assignment
- instructions for the final assignment can be found below.
Datacamp
From Datacamp, do the following courses for the first part of the course
A couple of notes on these datacamp courses:
The statistical simulation course starts with very simple statistical concepts. But rapidly things become more challenging. The focus of our seminar will not be on statistical simulation per se, but we will use it to understand the properties of estimators. Hence, it is important to understand the "flow" of having a statistical process and then repeating it 10,000 times to understand its properties. You also learn how to use numpy's statistical functions from numpy.random
.
The point for us of this Datacamp course is to become comfortable with modeling data generating processes. Not the specific applications considered in this course.
You may not have seen the get
method of a dictionary. Here you see it in action in a simple example (borrowed from stackoverflow):
sentence="The quick brown fox jumped over the lazy dog." characters={} for character in sentence: characters[character] = characters.get(character, 0) + 1 print(characters)
characters
is a dictionary with key
a character (including "space") from the sentence
and the value
equals the number of times the character has occurred up till then. If a character "happens" for the first time, get
cannot find it in the dictionary characters
and returns the default value (here specified as 0). If character has happened, say, 3 times before, get
returns the value 3 and we add 1, so the new value equals 4.
If you run into other functions that you are not familiar with, you can use "?", like in:
np.random.binomial?
Also, you can google, use chatgpt or bard.
Things to take away from this course:
- how to use random variables in python
- how to create samples out of a population (e.g. by using
np.random.choice
) - how to model statistical processes (data generating processes)
- how to use resampling methods like bootstrapping
- how to use permutation testing
- how to use simulation for power analysis
This keras course is "hands on" and has a lot of applications. If you prefer a course with some more background on the math of neural networks, you can do this one instead.
Note that for this keras course Chapter 4 is fun but optional.
The tensorflow course gives some more background on the syntax used in tensorflow that we also use in class. All the keras commands you learn in the keras course are easily applied under tensorflow.
For the second part of the course, you can do the following datacamp courses:
- Resources for pymc can be found here.
- a good video to start with is this one where one of the developers of pymc, Christopher Fonnesbeck, goes over the notebooks in this repository: https://github.com/fonnesbeck/intro_stat_modeling_2017
if you want to clone this repository in jupyter lab, run the following code on the server:
%%bash git clone https://github.com/fonnesbeck/intro_stat_modeling_2017.git
new version of pymc
A change compared to previous years is that the university servers have upgraded pymc3
to pymc
. This affects some aspects of the pymc
syntax used in the youtube link above and in our screencasts. Where relevant we will point this out on the screencast page and in our jupyter notebooks.
The new version of pymc
is imported as import pymc as pm
; no longer as import pymc3 as pm
. This has a number of implications:
- in the video and notebook we use
import pymc3 as pm
but this will give an error when you run it on the jupyterlab server; hence usepymc
instead ofpymc3
- the syntax for drawing samples from a distribution in
pymc
has changed:pm.Normal.dist(0,1).random(size=10)
will given an error- now use the following two lines:
x = pm.Normal.dist(0,1) pm.draw(x, draws=10)
See this blog for more information.
Other resources
- Other useful skills for datascience you may want to look at:
Deadlines
The deadline for the final assignment is: Friday April 25th 2025 at 23:59.
The resit deadline for the assignment is: Friday June 20th, 2025. Let us know by email that you have submitted your assignment for the resit. Further, follow the instructions below on how to submit an assignment on github and fill in the google form etc.
Questions
If you have questions/comments about this course, go to the issues page open a new issue (with the green "New issue" button) and type your question. Use a title that is informative (e.g. not "question", but "question about instrumental variables"). Go to the next box ("Leave a comment") and type your question. Then click on "Submit new issue". We will answer your question as quickly as possible.
The advantages of the issue page include:
- if you have a question, other students may have it as well; in this way we answer the questions in a way that everyone can see it. Also before asking the question, you may want to check whether it was asked/answered before on the issue page
- we answer your question more quickly than when you email us
- you increase your knowledge of github!
Only when you need to include privately sensitive information ("my cat has passed away"), you can send an email.
In order to post issues, you need to create a github account (which you need anyway to follow this course).
Note that if your question is related to another issue, you can react to the earlier issue and leave a comment in that "conversation".
Assessment material
We have a separate page with all relevant assessment material.
Final Assignment
- The final assignment you can do alone or with at max. one other student (i.e. max group size is 2).
- for the deadline of the assignment, see Deadlines above
- on Canvas we give you the link to the github repos. with the
assignment_notebook.ipynb
- to submit your final assignment:
- do not change the name of the
assignment_notebook.ipynb
notebook - fill in this google form
- push the final notebook on the github classroom repository
- do not change the name of the
what we are looking for
The idea of the assignment is that you report your findings in a transparent way that can easily be verified/reproduced by others. The intended audience is your fellow students. They should be able to understand the code you write together with the explanations that you give for this code.
The following ingredients will be important when we evaluate your assignment:
- Create a "big dataset" from an economic organization providing data; think of:
- OECD
- World Bank (recall that we use a python API to access this data in Methods: Python programming; this you can use as well, of course)
- IMF
- Federal Reserve
- European Union
- European Central Bank
- statistical office of your own country, e.g. Statistics Netherlands
- if you want to use another economic data source, ask us first
- Data handling:
- download the data to your repos. (in a separate folder "data") and
- in your notebook create a link to the website of the data source
- give the code how you merged separate datasets into one big dataset that you use
- explain what you did (including the code) and why you did the data cleaning steps to get the data from the downloads to the data that you use in the analysis
- Start your analysis with a clear and transparent question.
- Briefly motivate why this question is interesting.
- Explain the methods that you use to answer the question.
- are your methods based on correlations (only)?
- do they allow you to make claims about causality?
- Give the answer that you find (as a preview).
- Mention the main assumptions that you need to get this answer.
- Use graphs to introduce your data
- If you use equations, use latex to make them easy to read.
- Explain your code, the reader –think of your fellow students– must be able to easily follow what you are doing.
- How well does your model fit the data?
- what methods do you use to evaluate this?
- Present a clear conclusion/answer to your question.
- Include some discussion of what you find and elements on which you need additional information.
Three remarks:
- you can copy code from the web; but
- make sure that you explain the code that you use so that another student of the course understands it and can use it;
- give the reference of the code that you copy;
- use common sense: it is not always necessary to have a full blown economic model, but we do expect you to think!
- in the past we had students looking at the effect of age on income in sports; "theory" suggests that this relation is hump-shaped: 5 year olds and 80 year olds tend not to earn a lot of money as elite athletes; the students presented a scatter plot with a clear hump-shape; then they wrote "now we do a linear regression".
- for each step that you program, ask yourself why this step makes sense and then explain this in your notebook.
- show us what you have learned during this course; hence use a number of topics we discussed in your final assignment, for example:
- simulate data to verify the estimation techniques that you use
- download your data in the notebook using a python API
- use pandas to merge different datasets, clean your data, create new variables
- explain clearly what the causal relations are in your analysis
- use methods like: ridge and lasso regressions, neural network, Bayesian analysis
- explain why you use these methods to answer your research question (what are pros and cons of the methods)
- explain the choices that you make within a method (think of the number of layers and epochs in a neural network)
- use more than one method and compare the results:
- discuss what is different and why
- simply downloading an existing dataset and estimating a neural network on this will not be enough to get a passing grade
resit of final assignment
The resit of the final assignment needs to be a new project compared to the one you handed in before. The easiest way to achieve this is to choose a new research question and a new data set. You can use the same data if you make sure that research question and analysis are sufficiently different from before.
Simply adjusting your first submission based on our feedback will be not be enough.
Apart from this, follow the procedure above on how to submit the assignment and fill in the google form.