Seminar Data Science for Economics

This website contains the material for the MSc course Seminar Data Science for Economics

This website is under construction for 2024/2025

This year the course is taught by:

Jan Boone
Űsame Berk Aktaş

Course description

In the course Data Science for Economics you will learn how to deal with big data and using simulations to answer policy questions using recent advances in machine learning.

Whether you are a public policy-maker or a data analyst at a private firm, you might be tasked with providing answers to questions like "What is the optimal pricing scheme for our products?" or "Who should receive state benefits?". In such cases you want to use large data to provide convincing policy prescriptions. However, there are different challenges to it such as obtaining the data, cleaning it, collaborating with your co-workers, and most importantly deciding which variables to use in your analysis.

This course will help you overcome those challenges. The "traditional" econometrics you have learned in previous classes provides you with solid knowledge in answering causal questions. Machine learning toolkit, which you will also learn to apply in this class, is primarily targeted to provide the best prediction rather than to answer causal questions. However, when combined together they will help you to work with highly-dimensional datasets, where there can be more variables than observations, and you do not know at the start which variables and interaction terms to include.

In this course, we give an introduction to data-project management, tensors (data with more than 2 dimensions), data simulation, neural networks, cross validation. The goal is to get you up to speed with developments in datascience applicable for economic analysis.

We use data simulation to get the main intuitions across starting from the difference between correlation and causality, the use of instrumental variables and how to deal with heterogeneous treatment effects. We cover the estimation of neural networks using training, test and validation sets and Bayesian estimation techniques.

In terms of software we will be using python, pymc and google's tensorflow.

For this course we are very happy that we partner with Datacamp: register for Datacamp

Required Prerequisites

Students make sure they follow Econometrics 1 and the mandatory course Methods: Python programming for economists.

Course description
Screencasts
Organisation of the course
Final Assignment
- what we are looking for
- resit of final assignment

Screencasts

For this course a series of screencasts is available. Screencasts are available for part 1 and part 2.

Organisation of the course

day	date	time	staff	room	topic	datacamp	assignment
Tue	2025-01-28	12:45-14:30	Jan	Cube 221	Introduction to the course	statistical simulation 1-4
Tue	2025-02-04	12:45-14:30	Jan	Cube 221	part 1: distributions, bootstrapping	keras 1-4
Tue	2025-02-11	12:45-14:30	Jan	Cube 221	OLS, causality	tensorflow 1-4
Tue	2025-02-18	12:45-14:30	Jan	Cube 221	tensors, first neural network
Thu	2025-02-20	14:45-16:30	Jan	Cube 221	over/under fitting, neural network
Tue	2025-02-25	12:45-14:30	Jan	Cube 221	treatment effects
Thu	2025-02-27	14:45-16:30	Jan	Cube 221	part 2: dealing with data
Tue	2025-03-11	12:45-14:30	Jan	Cube 221	Bayesian statistics
Thu	2025-03-13	14:45-16:30	Jan	Cube 221	estimating a Bayesian model, missing data	https://www.youtube.com/watch?v=TMmSESkhRtI
Tue	2025-03-18	12:45-14:30	Jan	Cube 221	Bayesian time series, Bayesian neural network
Thu	2025-03-20	14:45-16:30	Jan	Cube 17	confidence interval and Q&A
Fri	2025-04-25						deadline
Fri	2025-06-20						resit

we will see how fast we go, the column "topics" is an indication of what will be discussed in each week
for the first part, you will go through this notebook
for the second part, through this one
there is also a Datacamp course on git: this is recommended but not mandatory

First Lecture

Introduction

Assignment 1

Do the following steps:

if you did Methods: Python programming for economists, you already have a github account, otherwise create a github account
go to
- jupyter lab
  - IT suggests that you use the Firefox browser to access jupyter lab
  - sometimes it helps to access jupyter lab with an incognito/private window
- or –if all else fails– you can use google's colab
create a new python notebook and type the following code in the first cell:

%%bash

git clone https://github.com/janboone/msc_datascience.git

then press the Shift key and Enter key as the same time
this creates a folder on the server msc_datascience that contains the material for the python part of the course.
Note: you can only run this command once. If you run it again, you get an error since the folder already exists.

Final assignment

instructions for the final assignment can be found below.

Datacamp

From Datacamp, do the following courses for the first part of the course

A couple of notes on these datacamp courses:

The statistical simulation course starts with very simple statistical concepts. But rapidly things become more challenging. The focus of our seminar will not be on statistical simulation per se, but we will use it to understand the properties of estimators. Hence, it is important to understand the "flow" of having a statistical process and then repeating it 10,000 times to understand its properties. You also learn how to use numpy's statistical functions from numpy.random.

The point for us of this Datacamp course is to become comfortable with modeling data generating processes. Not the specific applications considered in this course.

You may not have seen the get method of a dictionary. Here you see it in action in a simple example (borrowed from stackoverflow):

sentence="The quick brown fox jumped over the lazy dog."
characters={}

for character in sentence:
    characters[character] = characters.get(character, 0) + 1

print(characters)

characters is a dictionary with key a character (including "space") from the sentence and the value equals the number of times the character has occurred up till then. If a character "happens" for the first time, get cannot find it in the dictionary characters and returns the default value (here specified as 0). If character has happened, say, 3 times before, get returns the value 3 and we add 1, so the new value equals 4.

If you run into other functions that you are not familiar with, you can use "?", like in:

np.random.binomial?

Also, you can google, use chatgpt or bard.

Things to take away from this course:

how to use random variables in python
how to create samples out of a population (e.g. by using np.random.choice)
how to model statistical processes (data generating processes)
how to use resampling methods like bootstrapping
how to use permutation testing
how to use simulation for power analysis

This keras course is "hands on" and has a lot of applications. If you prefer a course with some more background on the math of neural networks, you can do this one instead.

Note that for this keras course Chapter 4 is fun but optional.

The tensorflow course gives some more background on the syntax used in tensorflow that we also use in class. All the keras commands you learn in the keras course are easily applied under tensorflow.

For the second part of the course, you can do the following datacamp courses:

Resources for pymc can be found here.
- a good video to start with is this one where one of the developers of pymc, Christopher Fonnesbeck, goes over the notebooks in this repository: https://github.com/fonnesbeck/intro_stat_modeling_2017

if you want to clone this repository in jupyter lab, run the following code on the server:

%%bash

git clone https://github.com/fonnesbeck/intro_stat_modeling_2017.git

new version of `pymc`

A change compared to previous years is that the university servers have upgraded pymc3 to pymc. This affects some aspects of the pymc syntax used in the youtube link above and in our screencasts. Where relevant we will point this out on the screencast page and in our jupyter notebooks.

The new version of pymc is imported as import pymc as pm; no longer as import pymc3 as pm. This has a number of implications:

in the video and notebook we use import pymc3 as pm but this will give an error when you run it on the jupyterlab server; hence use pymc instead of pymc3
the syntax for drawing samples from a distribution in pymc has changed:
- pm.Normal.dist(0,1).random(size=10) will given an error
- now use the following two lines:

x = pm.Normal.dist(0,1)
pm.draw(x, draws=10)

See this blog for more information.

Other resources

Other useful skills for datascience you may want to look at:
- regular expressions Python
- intro to scraping

Deadlines

The deadline for the final assignment is: Friday April 25th 2025 at 23:59.

The resit deadline for the assignment is: Friday June 20th, 2025. Let us know by email that you have submitted your assignment for the resit. Further, follow the instructions below on how to submit an assignment on github and fill in the google form etc.

Questions

If you have questions/comments about this course, go to the issues page open a new issue (with the green "New issue" button) and type your question. Use a title that is informative (e.g. not "question", but "question about instrumental variables"). Go to the next box ("Leave a comment") and type your question. Then click on "Submit new issue". We will answer your question as quickly as possible.

The advantages of the issue page include:

if you have a question, other students may have it as well; in this way we answer the questions in a way that everyone can see it. Also before asking the question, you may want to check whether it was asked/answered before on the issue page
we answer your question more quickly than when you email us
you increase your knowledge of github!

Only when you need to include privately sensitive information ("my cat has passed away"), you can send an email.

In order to post issues, you need to create a github account (which you need anyway to follow this course).

Note that if your question is related to another issue, you can react to the earlier issue and leave a comment in that "conversation".

Assessment material

We have a separate page with all relevant assessment material.

Final Assignment

The final assignment you can do alone or with at max. one other student (i.e. max group size is 2).
for the deadline of the assignment, see Deadlines above
on Canvas we give you the link to the github repos. with the assignment_notebook.ipynb
to submit your final assignment:
- do not change the name of the assignment_notebook.ipynb notebook
- fill in this google form
- push the final notebook on the github classroom repository

what we are looking for

The idea of the assignment is that you report your findings in a transparent way that can easily be verified/reproduced by others. The intended audience is your fellow students. They should be able to understand the code you write together with the explanations that you give for this code.

The following ingredients will be important when we evaluate your assignment:

Create a "big dataset" from an economic organization providing data; think of:
- OECD
- World Bank (recall that we use a python API to access this data in Methods: Python programming; this you can use as well, of course)
- IMF
- Federal Reserve
- European Union
- European Central Bank
- statistical office of your own country, e.g. Statistics Netherlands
- if you want to use another economic data source, ask us first
Data handling:
- download the data to your repos. (in a separate folder "data") and
- in your notebook create a link to the website of the data source
- give the code how you merged separate datasets into one big dataset that you use
- explain what you did (including the code) and why you did the data cleaning steps to get the data from the downloads to the data that you use in the analysis
Start your analysis with a clear and transparent question.
Briefly motivate why this question is interesting.
Explain the methods that you use to answer the question.
- are your methods based on correlations (only)?
- do they allow you to make claims about causality?
Give the answer that you find (as a preview).
Mention the main assumptions that you need to get this answer.
Use graphs to introduce your data
If you use equations, use latex to make them easy to read.
Explain your code, the reader –think of your fellow students– must be able to easily follow what you are doing.
How well does your model fit the data?
- what methods do you use to evaluate this?
Present a clear conclusion/answer to your question.
Include some discussion of what you find and elements on which you need additional information.

Three remarks:

you can copy code from the web; but
- make sure that you explain the code that you use so that another student of the course understands it and can use it;
- give the reference of the code that you copy;
use common sense: it is not always necessary to have a full blown economic model, but we do expect you to think!
- in the past we had students looking at the effect of age on income in sports; "theory" suggests that this relation is hump-shaped: 5 year olds and 80 year olds tend not to earn a lot of money as elite athletes; the students presented a scatter plot with a clear hump-shape; then they wrote "now we do a linear regression".
- for each step that you program, ask yourself why this step makes sense and then explain this in your notebook.
show us what you have learned during this course; hence use a number of topics we discussed in your final assignment, for example:
- simulate data to verify the estimation techniques that you use
- download your data in the notebook using a python API
- use pandas to merge different datasets, clean your data, create new variables
- explain clearly what the causal relations are in your analysis
- use methods like: ridge and lasso regressions, neural network, Bayesian analysis
  - explain why you use these methods to answer your research question (what are pros and cons of the methods)
  - explain the choices that you make within a method (think of the number of layers and epochs in a neural network)
  - use more than one method and compare the results:
    - discuss what is different and why
- simply downloading an existing dataset and estimating a neural network on this will not be enough to get a passing grade

resit of final assignment

The resit of the final assignment needs to be a new project compared to the one you handed in before. The easiest way to achieve this is to choose a new research question and a new data set. You can use the same data if you make sure that research question and analysis are sufficiently different from before.

Simply adjusting your first submission based on our feedback will be not be enough.

Apart from this, follow the procedure above on how to submit the assignment and fill in the google form.