Below is a table with problem sets. Hand in your problem sets through Canvas if they are R assignments.

Problem sets with deadlines
PS # Deadline
0 -
1 Sep 12 at 11:59
2 R: Sep 30 at 11:59, non-R: Oct 4 at 08:30
3 Oct 11 at 08:30
4 Oct 18 at 08:30
5 Nov 1 at 08:30
6 Nov 11 at 17:00
Paper approval Nov 17 at 11:20
7 Nov 22 at 08:30
8 Nov 29 at 17:00
Paper Dec 2 at 12:30

PS 0: Math

This is a list of informal questions posed throughout the lectures (and some new ones). This PS is ungraded.

  1. Let \(x_i\) be a finite sequence, and let \(\bar{x}\) be its average. Prove that \(\sum_i (x_i - \bar{x})=0\).

  2. Consider the setting in 1. Prove that \[\sum_i (x_i - \bar{x})^2 = \left(\sum_i x_i^2\right) - n \bar{x}^2.\]

  3. Consider the sequence \((x_i=\frac{1}{i^2})\). What is its limit? Prove it.

  4. Does the sequence \((x_i = (-1)^i)\) have a limit?

  5. What is the limit of the sequence with terms \(\frac{(-1)^i}{i^2}\)?

  6. Let \(\{a_{1},\cdots,a_{n}\}\) and \(\{b_{1},\cdots,b_{n}\}\) be two sets of numbers, and let \(c\) be a constant. Which one of the following two statements is true? Provide a proof.
    • \(\sum_{i=1}^{n}ca_{i}=nc\sum_{i=1}^{n}a_{i}\)
    • \(\sum_{i=1}^{n}ca_{i}=c\sum_{i=1}^{n}a_{i}\)
  7. Which one of the following two statements is true? Provide a proof.
    • \(\left(\sum_{i=1}^{n}a_{i}\right)\times\left(\sum_{i=1}^{n}b_{i}\right)=\sum_{i=1}^{n}\sum_{i=1}^{n}a_{i}b_{i}\)
    • \(\left(\sum_{i=1}^{n}a_{i}\right)\times\left(\sum_{i=1}^{n}b_{i}\right)=\sum_{i=1}^{n}\sum_{j=1}^{n}a_{i}b_{j}\)
  8. Using the property of the log-function that \(\log(ab)=\log(a)+\log(b)\), prove that \(\log(x^{c})=c\log\left(x\right)\), for \(x>0\) and any non-negative integer \(c\).

  9. The linear function \(y=\beta_{0}+\beta_{1}x\) has a constant marginal effect: \(\frac{\Delta y}{\Delta x}\) does not depend on x. The log-log function \(\log(y)=\beta_{0}+\beta_{1}\log(x)\) has approximately constant elasticity: \(\frac{\Delta y/y}{\Delta x/x}\approx\beta_{1}\) does not depend on x. For the function \(y=\beta_{0}+\beta_{1}\log(x)\), show that the semi-elasticity that does not depend on x.

  10. Come up with a real-world example where the relationship between \(y\) and \(x\) can be argued to be of the form \(y=\beta_{0}+\beta_{1}\log(x)\) rather than of the linear or log-log form.

PS 1: Intro

Hand in the .Rmd file that contains your markdown document, and the resulting .html file, on R. I consider this to be a written assignment, so make sure to add your name and student ID, and to add plenty of words to your code.

  1. Write down one line of R code that would generate a vector with the first 50 odd numbers, and name that object “odd”. Do the same for the first even numbers, and call that object “even”. What do you know about difference of the sums of those two sequences, from the properties of summation? Verify this using R. Hint: try

  2. Use a for loop for this exercise. For an arbitrary \(S\), generate \(S\) sequences and verify that they satisfy \(\sum_i x_i - \bar{x}=0\). The sequences should have length 100, and be random draws from the standard Normal distribution. Try


    or use Google.

  3. Use R to to show happens to the sequence \(x_i\) with \(x_i = 1/i\) as \(n \to \infty\). What happens to \(\sum_{i=1}^{n}\frac{1}{i}\) as \(n\to\infty\)?

  4. Use R to generate a table that demonstrates how well the approximation \(\Delta\log(x)\approx\frac{\Delta x}{x}\) works for different values of \(x\) and \(\Delta x\).

PS 2: Probability and Stats


  1. Come up with a real-world example in economics of a random variable \(X\) that is part-discrete, part-continuous, in the sense that the sample space is uncountable, but contains at least one point \(x_{k}\) for which \(P(X=x_{k})>0.\) Draw the pdf and cdf of this random variable.

For the following proofs, assume that \(X,Y,Z\) are discrete random variables with \(k\) outcomes, i.e. they have sample spaces \(\left\{ x_{1},\cdots,x_{k}\right\}\) resp. \(\left\{ y_{1},\cdots,y_{k}\right\}\), \(\left\{ z_{1},\cdots,z_{k}\right\}\).

  1. Prove property (CE.4’) in Wooldridge, Appendix B.
  • Edition 4, page 736
  • Edition 5, page 744
  1. Prove property (CV.1) in Wooldridge, Appendix B
  • Edition 4, page 737
  • Edition 5, page 745
  1. In the context of property (CE.6), prove that \[E\left[ \left[Y-\mu(X)\right]^{2}|X\right] \leq E\left[ \left[Y-g(X)\right]^{2}|X\right]\] implies \[E\left[ \left[Y-\mu(X)\right]^{2}\right] \leq E\left[\left[Y-g(X)\right]^{2}\right].\]


  1. Wooldridge, Exercise C.2;
  2. Wooldridge, Exercise C.3;
  3. Wooldridge, Exercise C.4, skip (iv).

R: Monty Hall

You will be writing a program that simulates the Monty Hall problem. You must write functions to do so.

The Monty Hall problem is formulated as follows:

You are in the final stage of a game show. Three doors are in front of you. Behind one of the doors is the car, and behind the other two are bicycles. You want the car. You can open one door, and will receive what is behind that door. Initially, you choose a door (say the one on the left). The game host, who knows where the car is, now opens another door which has a bicycle (say the one in the middle). He then asks you: “Would you prefer to switch to the door on the right?”.

  1. Would you? Use probability theory to show that switching (or staying) maximizes the probability that you will win the car.

Using R, we will solve a more general problem. For this more general version, there are \(k\) doors, and \(m\) of them have cars behind them. Initially, you pick a door. Then, the game show host opens one of the doors that does not contain a car. All of those eligible doors have equal probability of being opened by the game show host. Now, you are allowed to change your mind and pick another door.

  1. Design a simulation experiment in R that answers this question.
    • Comment your code so that somebody else would understand what you are doing, even if they were not aware what the question is;
    • Start with \(k=3,m=1\);
    • Enforce \(m<k-1\). (Why?)
    • For simulation, you can use sample.

For ex-sample,

k <- 7
## [1] 3

returns a randomly selected integer between 1 and 7. Be careful when passing a single integer to sample:

##  [1] 2 2 5 5 5 5 2 5 3 5

PS 3: Matrices


Choose six out of the following nine.

  1. Prove that the identity matrix of size \(n\) is positive definite.

  2. Is the \(n\times n\) matrix of ones, \(A=[a_{ij}],\) with \(a_{ij}=1\) for all \(i,j\), positive definite? If yes: prove it. If no, explain why.

  3. Prove that, for any square matrix \(A\), \(\text{tr}\left(A\right)=\text{tr}\left(A'\right)\). What can you say if \(A\) is not square?

  4. Let X be any \(m\times n\) matrix. Show that \(X'X\) is positive semi-definite.

  5. Let \(X\) be an \(n\times k\) matrix with \(\text{rank}\left(X\right)=k<n\). Let \(P=X\left(X'X\right)^{-1}X'.\) Show that \(P\) is idempotent.

  6. Let \(P\) be an idempotent matrix. Show that \(M=I-P\) is idempotent.

  7. Wooldridge, p. 798, Exercise D.1;

  8. Wooldridge, p. 798, Exercise D.2;

  9. Wooldridge, p. 798, Exercise D.6 (i).


At the end of week 4’s lab instructions, you created a scatterplot of the relationship between Canada’s wealth and income inequality, for the period 1991-1997. You also approximated the relationship between wealth and income inequality using your ols function.

  1. Let \(\hat{x}\) denote the approximate solution of the linear system relating wealth to income inequality. Compute \(\hat{b}=A \hat{x}\). Add A[,2],bhat to the plot.

  2. Improve the plot by adding top1 and top5 to the plot, and by (use facet_wrap) adding the same plot for the US next to it.

  3. Apply your ols function to the other five relationships (you have already done Canada+top10; now add Canada+top1, Canada+top5, US+top1, US+top5, US+top10). Interpret the differences between the approximate solutions you find.

  4. Use mutate and summarize in the dplyr package to construct wid_decade, the data at a decadal level. Repeat the analysis above, for as much decadal data as is available in the original data set, for US and Canada. Compare the approximate solution for all of the six cases ({Canada,US} x {top1,top5,top10}) to those based on annual data (your answers to 10 and 12). What do you find?

PS 4: Matrix OLS

  1. Wooldridge, Exercise E.3
  2. Wooldridge, Exercise E.4, (i)-(iii)
  3. Wooldridge, Exercise E.5

PS 5: Twins and the returns to education


The remaining problem sets are (partial) replications of empirical papers in economics. Here are some additional instructions for handing in your replications:

  1. Hand in the .Rmd and .html file, as well as the data used by your script (unless your R code loads the data directly from the web).

  2. Your markdown file must contain at least the following:
    • an explanation of what you do
    • an explanation of why you do it
    • embed code, and results
    • an interpretation of the results you obtain
  3. You are also expected to make a comparison with the results in the paper that you are attempting to replicate. You will rarely get exactly the same results as the original paper, see e.g. Business Insider.
    • if you get slightly different results, comment on it
    • if you get results that are more than half a reported standard error away from the results in the paper, you have uncovered an error by the authors, or you are making a mistake - explain!
    • always report robust standard errors
    • focus on point estimates - do not worry about getting exactly the same standard errors as the authors
  4. These assignments may involve a substantial amount of work. Plan wisely.

  5. The deadline will be strictly enforced: just send me whatever you have by the deadline. Start your final .Rmd compilation and upload more than one hour before the deadline: you will run into trouble otherwise and miss the deadline.

Data and project management

Let me mention some principles of good data and project management, which I am adapting from Brian Krauth’s version of this course.

The most important principle is to always remember that you will make mistakes, and that you will not remember what you have done. When you run into a problem, the key to fixing it will be the ability to retrace your steps.

  • Set aside a directory specifically for each project. Keep everything you do together in that directory (don’t just save it to the desktop, for example).
  • Anytime you get raw data from an external source, keep an unaltered copy of the original le, and document where it came from. If you are going to edit the file, make a copy under a new name, and edit the copy.
  • When creating new files, choose informative names, but avoid spaces in names.
  • When manipulating and setting up data, write script files in R Markdown instead of doing things by hand (or by menus).


Ashenfelter and Krueger’s Estimates of the Economic Returns to Schooling from a New Sample of Twins, was published in the American Economic Review in 1994, and uses a sample of twins to estimate the returns to education. You can find the paper, and the data online. The New York Times discussed the paper in 1992.

In this problem set, do not worry about the measurement error part of the paper. We will revisit this in PS7.

  1. What is the goal of the paper? What is the empirical strategy? Summarize it in four sentences.

  2. Write down the model underlying the results. This should involve a fixed effect, \(\alpha_{i}\), where \(i\) refers to the \(i\)-th pair of twins. Explain what \(\alpha_{i}\) captures.

  3. Load the data from the .dta. Use the rio package, the haven package, or the foreign package.

  4. Load the data from the .csv-file and check whether the two data sets are the same. Use rio or readr. Note: depending on how you load the data, this page may be useful.

  5. Continue with the data from the .dta file. Replicate Table 1, as far as the data allows. Comment on the parts that you cannot replicate.

  6. Replicate Table 3, columns (i) and (v).

  7. Formulate one criticism of the paper in your own words. In light of your criticism, do you expect the findings of the paper to be an over- or underestimate of the returns to schooling?

  8. Based on you replication, what do you believe about the relationship between education and wages?

Move on to PS6 directly: it is the most time-consuming of problem sets of this semester.

PS 6: Democracy and income

This problem set will involve a substantial amount of work due to data management and transformation. I highly recommend thoroughly reading a few chapters of Hadley Wickham’s R for Data Science book, e.g. Chapters 9, 10, 11, 12, 13, 15 - or at least skimming them and having them available for future reference. The dplyr package will be extremely useful this week.

We will be replicating an influential paper by Daron Acemoglu and coauthors. In this paper, an attempt is made to find a causal link from democracy to income.

Acemoglu’s website is here, the paper is here. If this topic and type of approach is interesting to you, you can find his book here.

To download the data and load it into R, download the zip (contains an Excel file) from Acemoglu’s website. To load the data, use Google to find help on how to load spreadsheet data into R. One approach is to select the particular sheet in the Excel file, save it as .csv, and load it as a textfile. Alternatively, you can use “XLConnect”, “xslx”, or “gdata” (use Google). An alternative approach is to use the instructions in the Piketty part of this course.

  1. Read the abstract and introduction to this paper. If you are interested in the (causal) effect of democracy on income, what is the problem with running a regression with income as your dependent variable and democracy as your explanatory variable?

  2. Replicate Figure 1. Your Figure does not have to look exactly the same, but it should contain the same information. Hint: look carefully at the sheets in the spreadsheet before you go all-in on transforming data.

  3. To check whether the data import went well, replicate
    • The “pooled OLS” column in Table 3 (column 1). Hint: how can the variable called “sample” help here?
    • Table 1, Panel B.
  4. What happens to your results for pooled OLS (previous question) if you throw out all South American countries?

  5. Replicate Table 3, columns (2), and (7). Focus on the rows with regression coefficients and the number of observations (so: skip “cumulative effect”, the R-squared, etc.). Replicate each column three times:1
    • Once using the plm package2
    • Once by implementing the fixed effects estimator using your own code (you can use functions in dplyr, the function lm, and/or R’s matrix operations, but you can not use factor, as.factor, or external packages that implement panel data estimators. Exception: you can use factor for including the year dummies):
      1. compute the appropriate transformation
      2. apply OLS to the transformed data.
    • Once by including, on top of all the other regressors in your model, one dummy variable for each country. Use factor(code) as an additional variable. When you report the results, do not output all of the coefficients, but hide the coefficients for factor(code) and factor(year).
  1. For Table 3, column 2, what do you get if you use the first-difference estimator instead of the fixed effects estimator? Use plm, then use a manual differencing transformation.

  2. Discuss and explain, in your own words, the difference between your findings in Table 3, column 1 and Table 3, column 2.

  3. Discuss and explain, in your own words, the difference between your findings in Table 3, column 2; Table 3; column 7.

  4. Discuss and explain, in your own words, the difference between Table 3, column 2, and the results for the first difference estimator in question 6.

If you want to know more about this topic, you can use Google Scholar to browse the 1050 papers that cite the paper you just replicated. Two notable recent contributions are this paper providing further evidence in favor of the findings, and this paper which looks at the same question using interesting new methods.

PS 7: Colonial origins

You are going to look at another paper by Acemoglu. We are once again going to investigate the role of institutions in promoting growth by studying a very ingenious paperThe paper and the data can be found through this website. that Acemoglu published in 2001. The paper is ingenious mainly because it suggests an original instrument. The equation of interest is equation (1), page 1378. The main empirical result is Table 4, Base sample (2).

Below, when reproducing the instrumental variable estimates, please do so in two ways:

  • using ivreg in the AER package
  • using lm two times in a row.
  1. Why can OLS estimates of \(\alpha\) in equation (1) not be interpreted as causal?

  2. What is the instrumental variable that the authors suggest to deal with the problem in Q1?

  3. Which number(s) in Table 4, and which Figure in the paper, are evidence that the suggested instrument is relevant?

  4. Is there empirical evidence that the instrument is valid? If yes: state the evidence. If no: why not?

  5. Replicate the 0.94 in Table 4, Column 1, and the 1.00 in Table 4, Column 2. Do not worry about standard errors and about the remaining results in the table.

  6. Now write 4 sentences to convince that the suggested instrument is not valid.

The final questions are a throwback to PS5.

You can use the data and code from the previous assignment.

  1. What is the reason that the authors use an instrument for the difference in education?

  2. Convince me that the instrument that they use is relevant.

  3. Convince me that the instrument is exogenous.

  4. Replicate the coefficient estimate for “Own education” in Table 3, Column VI.

  5. How do you interpret the difference in the coefficient estimates between Column V and Column VI in Table 3?

PS 8: Mafia and public spending

The final hand-in assignment is about a 2014 American Economic Review publication that tries to estimate the effect of public spending through the use of an exogenous change in anti-corruption legislation. The methods used by the paper include both panel data and instrumental variables estimation. You can find the paper, data, and other info here. Read the readme file and the paper’s introduction carefully before you start replicating.

We are interested in the results up to and including page 2201, i.e. everything up to and including Section IV A.. For each of the replication questions, write one paragraph interpreting the results.

  1. Summarize the empirical strategy of this paper in two paragraphs. What is the endogeneity problem? How do the authors deal with it? What do the fixed effects capture? Why is this a good instrument to use (emphasis on validity)?

  2. Generate a Table that looks like Table 1, but counts “Number of city councils dismissed because of either resignation by elected officials or special cases of ineligibility of the mayor.” (Resignation). Then do the same for Others.

  3. Replicate Table 4, Columns 1 (OLS) and 4 (2SLS, Second stage). Read the footnote carefully.

  4. Replicate Table 5, Column 1.

  1. There is a lot of online help for panel data, including help on the plm and the dplyr package. The book I recommend on the course website can be very helpful. There is also this link and the book Applied Econometrics with R that is available for free through SFU library.

  2. Watch the index and method options.