In this course, you will learn how to use R, a statistical computing environment. RStudio is the graphical user interface for R that we will be using.

You will be working on R in the Thursday labs, from 09:30-11:20.

### Week 1

#### Swirl

Swirl is software that runs from within R, and that teaches you how to use R.

Follow the instructions in Step 1 through 4 here to install R, Rstudio, and swirl. The final step (4) asks you to run

swirl()

and go through the first steps of the swirl program. Once you are asked which course you want to choose, pick “R Programming”. Go through the programming course.

#### R Markdown

Your hand-in assignments, whenever they involve working in R, should be handed in using R Markdown (you will hand in a .Rmd file and a .html file). To learn about using RMarkdown, start here and go through lessons 1, 2, and 3.

#### Assignment

This is a good time to start working on Problem set 1.

### Week 2

#### Functions

The best way to learn to work with functions is to try it. We will practice in the context of probability theory. Before you move on, please make sure that you are very comfortable with section 9 of the swirl course “R Programming”, which you should have completed last week.

#### Probability theory

The following is adapted from Irene Botosaru’s version of BUEC 333, which in turn is based on material from this course.

For this assignment we will create a joint probability table and use it to compute marginal and conditional probabilities, expectations and conditional expectations, variances, and pmf’s and CDF’s.

We have two random variables, $$X$$ and $$Y$$, with their respective sample spaces: $\Omega_x = \{0,5,10\}$ $\Omega_y = \{0,5,10,15\}$

The joint probability table for these random variables is given by:

           | P(Y=0) | P(Y=5) | P(Y=10) | P(Y=15)
|--------|--------|--------|---------|--------
|P(X=0)  | 0.02   | 0.06   | 0.02    | 0.10
|--------|--------|--------|---------|--------
|P(X=5)  | 0.04   | 0.15   | 0.20    | 0.10
|--------|--------|--------|---------|--------
|P(X=10) | 0.01   | 0.15   | 0.14    | 0.01
1. Generate this matrix in R, and call it p. It should look as follows:
##      [,1] [,2] [,3] [,4]
## [1,] 0.02 0.06 0.02 0.10
## [2,] 0.04 0.15 0.20 0.10
## [3,] 0.01 0.15 0.14 0.01

Your matrix p stores all joint probabilities. If you wanted to display P(X=5 and Y=20), which is the element on row 2, column 3 of matrix p, you write:

p[2,3]
## [1] 0.2
1. Display $$P(X=0,Y=5)$$.

To check that matrix p contains the pmf of $$X$$ and $$Y$$, you know that if you add all the elements of p you should get $$1$$. To see that this is the case, type:

sum(p)
## [1] 1

Remember that, for example, the marginal probability of $$X=0$$ is given by: $P(X=0) = P(X=0,Y=0) + P(X=0,Y=5) + P(X=0,Y=10) + P(X=0,Y=15).$

1. Using one line of code, compute the marginal probability distribution of $$X$$ as a $$3 \times 1$$ vector.

Instead of writing your own code, you could have used the “apply” technology. Using “apply” is a useful tool for advanced tasks in R, so I am giving you the below code for future reference.

px <- apply(p,1,sum) ## create marginal probabilities for X
px                   ## display these marginal probabilities
## [1] 0.20 0.49 0.31

The apply function says: Take matrix p, and, row by row (“1” means “by row”), compute the sum of the elements in the row. This function is very powerful since it computes the sums for all rows at once.

1. Use “apply” to compute the marginal probability distribution of $$Y$$.

Now, we will use your new skills of writing functions in R to compute conditional probabilities. Remember that the definition of conditional probabilities for discrete random variables is $P(X=x|Y=y) = \frac{P(X=x,Y=y)}{P(Y=y)}.$

1. Write a function “cond_prob” that, for any matrix that represents a joint probability distribution, for any value of the random variable that runs across the columns, returns the conditional probability distribution.

For the matrix p in our running example, we should have - for the second value of y

cond_prob(p,2)
## [1] 0.1666667 0.4166667 0.4166667
1. Try your function on p. Try it on another joint probability distribution, call it q, that you create yourself.

To compute expectations and variances we need to enter the values of $$X$$ and $$Y$$:

x<- c(0,5,10)
y<- c(0,5,10,15)
1. Write functions, “EX” and “VarX”, that return the expectation and variance of $$X$$. The function should accept a joint probability matrix and a vector that lists the values, such as “x” and “y” above.

Finally, we move to conditional expectation. Remember that, for example, $E(X|Y=5) = 0P(X=0|Y=5) + 5P(X=5|Y=5) + 10P(X=10|Y=5).$

1. Write a function that returns the conditional expectation as a function of the joint probability table, a vector of values for x, a vector of values of y, and the value (not the index: use “?which”) of $$Y$$ to condition on.

2. Construct two joint probability matrices and associated outcomes for $$X$$ and $$Y$$, and test your functions above.

#### Monty Hall

This is a good time to start working on the R part of PS2.

### Week 3

#### swirl, ctd

Although not required, I recommend the “Statistical Inference” course in swirl. Install it by writing

install_from_swirl("Statistical_Inference")

and then going through the “Statistical Inference” course.

#### Monte Carlo methods

Last week’s Monty Hall question required you to do some simulations. This week, we will look at Monte Carlo simulations in more detail. For an idea of why these can be useful outside of analyzing statistical methods (below), see this post.

Go to the course website and find the .Rmd file for week 3’s lecture slides. Download the file, open it in RStudio, and grab the function “dice_estimators” when it first occurs (slide “What does this function do?”).

Once you load it, run it and you should see something like:

dice_estimators(5,3)
##      [,1] [,2] [,3]
## [1,]  1.8    2  2.5
## [2,]  2.4    2  2.5
## [3,]  4.6    6  5.0

This is a poor example of programming, for two reasons. First, when you write a function, it should do exactly one thing. This function does multiple things: it generates data, and then applies several estimators. Second, for-loops in R should generally be avoided.

1. Write a function, “throw_dice”, that accepts as arguments the “sample size” n and the “number of replications” S, and returns an $$n \times S$$ matrix of dice throw outcomes. Generate one matrix of dice throws (n=20,S=10) and call it “simthrows”.

Here is my “simthrows”:

simthrows
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    2    2    1    3    2    5    3    5    2     5
##  [2,]    1    1    3    3    1    6    4    4    5     2
##  [3,]    1    3    5    6    1    6    1    4    1     3
##  [4,]    3    6    2    5    6    6    3    6    5     2
##  [5,]    5    6    3    3    4    6    5    3    6     3
##  [6,]    3    6    2    4    3    2    4    4    3     2
##  [7,]    3    3    1    3    6    2    6    5    6     3
##  [8,]    2    5    5    5    1    6    1    5    1     2
##  [9,]    5    5    3    1    1    6    1    2    1     3
## [10,]    3    1    3    6    4    3    6    1    3     6
## [11,]    2    4    3    3    4    4    5    6    1     3
## [12,]    1    6    1    3    6    3    4    5    3     5
## [13,]    6    5    5    2    4    5    3    6    2     6
## [14,]    6    6    4    5    2    6    5    1    3     5
## [15,]    3    2    1    4    6    5    1    6    6     1
## [16,]    6    5    2    3    1    6    1    6    3     4
## [17,]    4    3    3    1    4    4    3    6    5     1
## [18,]    5    1    6    4    4    2    6    6    4     4
## [19,]    3    3    4    4    6    2    4    2    4     3
## [20,]    5    4    1    6    5    5    3    1    3     4

Every column is one realization of a random sample. Next, we will compute estimators for each replication (column in “sim_throws”)

1. Compute the mean for the 7th replication. Compute the mean for the 5th replication.

Monte Carlo methods are a great place for getting comfortable with “apply”. Before you try to answer the next question, type “?apply” or have a look at this post.

1. Use “apply” to compute “theta1_hat”, the vector ($$S$$ entries) of sample means.

Here is my “theta1_hat”:

##  [1] 3.45 3.85 2.90 3.70 3.55 4.50 3.45 4.20 3.35 3.35

How did we manage to obtain theta1_hat?

• we have a function that computes the estimator for a column (“mean”)
• use apply to apply that function to each column in a matrix

Let us do the same thing for a second estimator. The estimator is defined as follows:

$\hat{\theta_2} = \frac{1}{n} \sum_i 1\{X_i \leq 3\} (X_i+1) + 1\{X_i \geq 4\}(X_i-1)$

1. Write a function that computes the estimator for one random sample.

2. Use “apply” to compute $$\hat{\theta}_2$$ for all replications in one go.

Finally, we can use the idea of Monte Carlo simulation to say something about the relative performance of our estimators.

1. Using your simulations, can you say something about whether the estimators are unbiased?

2. Which estimator is more efficient? Hint: use “var”.

The idea behind Monte Carlo simulation is that, if $$S$$ is sufficiently large, the distribution of the estimators generated by simulation is close to the distribution of the estimators.

1. What happens if there is an unfair die? Say that $$P(X=1)=1/2$$ and $$P(X=x)=1/10$$ if $$x>1$$. Again: is $$\hat{\theta}_2$$ unbiased?

#### Property plim.1

Now let us use MC methods to verify property PLIM.1 in Wooldridge, Appendix C. The property states:

Let $$\theta$$ be a parameter, let $$g$$ be a continuos function, and define a new parameter, $$\gamma = g(\theta)$$. Suppose that $$\hat{\theta}$$ is consistent for $$\theta$$. Then $$g(\hat{\theta})$$ is consistent for $$\gamma$$.

We will investigate this by considering estimation of the variance, as in Wooldridge, Appendix E. Let $$Y$$ be a random variable with mean $$\mu_Y$$ and variance $$\sigma^2_Y$$. For a random sample $$(Y_1,\cdots,Y_n)$$ on $$Y$$, define the following estimator of the variance: $S^2_n = \frac{1}{n} \sum_i \left( Y_i - \bar{Y} \right)^2.$

Note that this differs from the standard definition of $$S_n$$, which divides by $$n-1$$, not $$n$$.

1. Write a function that computes $$S_n^2$$ for a vector representing a random sample.

2. Generate a random sample of size 10 from the Chi-Squared distribution with 3 degrees of freedom, and non-centrality parameter 0 (see ?rchisq). Apply your estimator to it.

To check whether your estimate was close to the parameter it estimates (the variance of this distribution can be found on Wikipedia), we can run a simulation study.

1. Generate a matrix with 1000 samples of size 10, and apply your estimator. How does its distribution compare to the true answer? Does it look like the estimator is unbiased?

2. Repeat this procedure for sample sizes 100, 1000, 10000.

3. Repeat steps 1-4 for $$S_n$$, and compare it to the known standard deviation.

4. Compare the performance of $$S_n$$ under draws from the chi-squared to draws from the standard Normal.

### Week 4: Working with data

#### Capital

In his 2014 book “Capital in the twenty-first century”, Thomas Piketty investigates the relationship between capital and income inequality using a large cross-country data set.

You have probably read Piketty’s book. In case you have forgotten what it is about, please have a look at this 3 minute summary before starting this week’s R exercises.

There are many ways in which you can import data into R. The rio package will make your life a lot easier. Please install rio now.

install.packages("haven")
install.packages("rio")

Next, go to the World Income Database, the data base compiled and used by Piketty (with coauthors) to get some data.

1. Countries: select Canada, Netherlands, United Kingdom, and United States
2. Years: selects 1855 through 2015
3. Select variables
• Income > Top income shares > Top income shares, then click top 10%, top 5%, and top 1%.
• Wealth > Private wealth > Net private wealth

Then, click Retrieve. A table will show up. Click Download, which will download a file to your system, named report.xlsx.

Figuring out how to load data into R can be challenging (even with rio) and will be an integral part of your future assignments. For the current tutorial, the following command gets the job done:

library(rio)
wid <- import("report.xlsx",which=2,skip=1)[,c(1,2,3,7,11,15)]

The which argument tells rio (through readxl) which sheet to use (the data set has four sheets) and the skip=1 switch tells it to skip the first row. It stores the resulting data in a data.frame.

Have a look at your data by using summary, head, and tail, or by clicking your data frame in the Environment tab in RStudio.

It will be easier to work with this data if we assign shorter names. Run the following code to rename your variables

colnames(wid) <- c("country","year",
"top10","top5",
"top1","wealth")

Verify that you changed the variable names.

#### Manipulating and plotting

We will use the fantastic dplyr, magrittr, and ggplot2 packages to manipulate and plot the data. Please have a look at Hadley Wickham’s introduction to dplyr. If you prefer a video version, it is on youtube. Do not forget to install the packages:

install.packages("magrittr")
install.packages("dplyr")
install.packages("ggplot2")

library("magrittr")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
##     filter, lag
## The following objects are masked from 'package:base':
##
##     intersect, setdiff, setequal, union
library("ggplot2")

As a first step, we isolate the data for Canada after World War 2. Store the resulting data as widCA.

widCA <- filter(wid,
year>1945)

To start plotting, initialize the plot by telling ggplot2 what the data is, and what will go on the horizontal and vertical axes.

CA_plot <- ggplot(data=widCA,aes(x=year,y=top10))

Now, tell ggplot to plot the time series.

CA_plot + geom_point()
## Warning: Removed 11 rows containing missing values (geom_point).

Income inequality, as measured by the share of total income received by the top 10%, has sharply increased starting from the end of the 1980’s. Let us check whether the other two income share measures tell a similar story.

CA_plot + geom_point() +
geom_point(aes(y=top5),colour="red") +
geom_point(aes(y=top1),colour="blue")
## Warning: Removed 11 rows containing missing values (geom_point).

## Warning: Removed 11 rows containing missing values (geom_point).

## Warning: Removed 11 rows containing missing values (geom_point).

Use your Googling and ? skills to add a title (ggtitle), update the label on the vertical axis (ylab), and make sure that the horizontal axis does not extend beyond the year 2000 (xlim).

Now, we will focus on all four countries in our data set. Construct a new data frame wid70 (use filter) that contains all countries, for the time period after 1970. I print the first and last ten rows so that you can compare.

##    country year top10  top5 top1   wealth
## 1   Canada 1970 37.92 24.22 8.97 1136.548
## 2   Canada 1971 37.83 24.07 8.87 1208.131
## 3   Canada 1972 37.55 23.84 8.75 1287.204
## 4   Canada 1973 37.02 23.65 8.80 1355.431
## 5   Canada 1974 37.38 23.82 8.81 1362.736
## 6   Canada 1975 37.27 23.71 8.74 1397.886
## 7   Canada 1976 36.74 22.99 8.08 1432.797
## 8   Canada 1977 36.17 22.43 7.74 1513.211
## 9   Canada 1978 35.77 22.17 7.60 1616.563
## 10  Canada 1979 35.57 22.11 7.71 1698.479
##           country year top10  top5  top1   wealth
## 168 United States 2006 45.50 33.59 18.06 67705.58
## 169 United States 2007 45.67 33.84 18.33 68655.19
## 170 United States 2008 45.96 33.78 17.89 60706.17
## 171 United States 2009 45.47 32.81 16.68 54544.52
## 172 United States 2010 46.35 33.73 17.45 57130.80
## 173 United States 2011 46.63 33.98 17.47 58713.37
## 174 United States 2012 47.76 35.35 18.88 61345.56
## 175 United States 2013 46.70 34.10 17.43 68569.77
## 176 United States 2014 47.34 34.77 17.98       NA
## 177 United States 2015 47.81 35.30 18.39       NA

One way to visualize the data, and to check whether the four countries have a similar trend, is to colour the dots according to the country.

ggplot(data=wid70,aes(x=year,y=top10,colour=country)) + geom_point()
## Warning: Removed 70 rows containing missing values (geom_point).

• Where is the data for the United Kingdom?
• Do the observed patterns (differences in levels and changes across countries) make sense?

Another way to plot the data for different countries is by facetting, see facet_wrap. Recreate the following matrix of plots.

## Warning: Removed 97 rows containing missing values (geom_point).

## Warning: Removed 97 rows containing missing values (geom_point).

Next, we make this graph more interesting by adding the data on wealth, and using the entire time period.

Now that you are somewhat familiar with ggplot’s functionality, it may be useful to keep this cheatsheet handy.

We will use qplot, which provides a shortcut into ggplot’s functionality, see your cheatsheet.

qplot(x=wealth,y=top5,colour=country,data=wid)
## Warning: Removed 354 rows containing missing values (geom_point).

It is hard to interpret this plot, for a number of reasons. For one, the wealth measurement is not comparable across countries (CAD v USD v EUR) or across time (inflation). We should have kept the last variable in our data frame, which we threw away in the very beginning.

Reload the data, and redo the plot with wealth as a percentage of national income (this is the measure displayed in the graphs around 1:00 into the video that summarizes Piketty’s book). This time around, we will use the period 1980-now, and we will omit the UK data.

wid <- import("report.xlsx",which="Series-layout A",skip=1)[,c(1,2,3,7,11,21)]
colnames(wid) <- c("country","year",
"top10","top5",
"top1","wealth_perc")
wid <- filter(wid,country != "United Kingdom",
year>1979,
year<2001)

Redo the plot.

qplot(x=wealth_perc,y=top10,colour=country,data=wid)
## Warning: Removed 17 rows containing missing values (geom_point).

The data for the Netherlands looks quite different than that for Canada and the US.

Let us zoom back into Canada. We will use the path geom to see how the variables moved together over time.

wid_CA <- filter(wid,country == "Canada")
qplot(x=wealth_perc,y=top10,colour=country,data=wid_CA,geom="path")

Next, we will try to find a relationship between wealth and income inequality. To do that, we are making a small detour.

#### Intermezzo: Matrices

Matrices are useful tools for (approximately) solving linear systems of equations. If a system $Ax=b$ has a unique solution, then it can be represented as $x = A^{-1}b.$

In R, the inverse of a matrix can be computed by using solve. To try, use R to solve the linear system $x_1 = 2,2 x_1 + x_2 = 5.$

##      [,1] [,2]
## [1,]    1    0
## [2,]    2    1
## [1] 2 5
##      [,1]
## [1,]    2
## [2,]    1

We have also looked at approximate solutions to linear systems, for cases where no solution exists. In that case, if $$A$$ has full column rank, an approximate solution can be computed as $\hat{x} = P_A b = (A'A)^{-1}A'b.$

If an exact solution exists, then it will recover it:

xhat <- solve(t(A)%*%A)%*%t(A)%*%b
xhat
##      [,1]
## [1,]    2
## [2,]    1

An example of a system for which no exact solution exists is $x_1 =1,x_1=2.$ Use R to determine the approximate solution.

##      [,1]
## [1,]  1.5

Does that make sense?

Now, write a function ols, that accepts as arguments a matrix $$A$$ and a vector $$b$$, and returns the vector xhat that is the approximate solution.

Test your function ols on the above two examples.

#### Wealth and inequality

Let us go back to the data set on inequality and capital. We restrict our attention to a subperiod.

wid_CA_sub <- filter(wid_CA,year>1990,year<1998)
qplot(x=wealth_perc,y=top10,colour=country,data=wid_CA_sub,geom="path")

This plot suggests there exists an (approximate!) linear relationship between the two variables. In other words, approximately $top10_t = x_1 + x_2 wealth_t.$ That relationship, then, is a linear system of equations: $Ax=b$ where $$x=(x_1,x_2)$$ is the unknown relationship between the two variables, and every element of $$b$$ is the income inequality in a given year:

$b = \left[ \begin{matrix} top10_{1991} \\ top10_{1991} \\ ... \\ top10_{1997} \\ \end{matrix} \right]$

• How can you construct $$A$$, so that $$Ax=b$$ represents the linear relationship between income inequality and wealth?

• What is the rank of $$A$$?
• Is $$A$$ positive definite?
• Is $$A'A$$ positive definite?
• Is $$A'A$$ invertible?
• Does $$Ax=b$$ have an exact solution?
• Compute the (approximate?) solution to $$Ax=b$$
• Inspect $$\hat{x}$$. What is the relationship between $$\hat{x}$$ and the scatterplot?

### Week 5: OLS simulations

You will be conducting a simulation study to uncover some properties of the OLS estimator.

By way of introduction, consider the following simulation study for two estimators of the sample mean: \begin{aligned} \hat{\theta}_1 &= \frac{1}{n} \sum_{i=1}^n X_i \\ \hat{\theta}_2 &= \frac{1}{n-10} \sum_{i=1}^{n-5} X_i \\ \end{aligned}

This simulation study makes use of the foreach package, which is a convenient tool for simulation studies like this one. There is accessible information online. Alternatively, you can use regular for loops or the apply functionality in R.

library(foreach)
## foreach: simple, scalable parallel programming from Revolution Analytics
## Use Revolution R for scalability, fault tolerance and more.
## http://www.revolutionanalytics.com
S <- 100
results <- foreach(s=1:S, .combine=rbind) %:%
foreach(n=c(25,50,100,150), .combine=rbind) %do% {

Xn <- runif(n,min=0,max=10)
theta1hat <- mean(Xn)
theta2hat <- sum(Xn[1:(n-5)]) / (n-10)

return(matrix(c(s,n,1,theta1hat,
s,n,2,theta2hat),
nrow=2,ncol=4,
byrow = TRUE))

}

For every run, we return two rows. Each row contains:

• the number of simulation run s
• a parameter related to the design of the simulation study - the sample size n
• which estimator
• realization of that estimator for that simulation

We can turn these results into a dataframe and use ggplot to visualize the results.

library(ggplot2)
results <- as.data.frame(results)
colnames(results) <- c("s","n","estimator","estimate")
results$estimator <- as.factor(results$estimator)
ggplot(data=results,aes(colour=estimator)) +
geom_density(aes(x=estimate)) +
facet_wrap(~n,nrow=2)

Some questions:

• What happens when $$n$$ becomes larger?
• What happens to the sample mean?
• What happens to the different between the two estimators
• What happens when you increase $$S$$?
• What happens when you change the distribution from which you draw $$X_n$$?

Now, I want you to design a small simulation study, similar to the one above, for the OLS estimator with a constant term and one regressor $$x_1$$. You have already implemented the OLS estimator: it is your ols function from last week.

First, what are the design parameters in this study? Which parameters can you change, and what effect do you think they have on your simulation results?

• sample size $$n$$
• number of simulations $$S$$
• distribution of the error terms
• value of $$\beta_1$$

In your simulations, compare the OLS estimator to the following estimator, where $$y_n$$ is the vector containing the dependent variables $$(y_1,\cdots,y_n)$$ and $$x_n = (x_1,\cdots,x_n)$$:

$\tilde{\beta_1} = \frac{max(y_n)-min(y_n)}{max(x_n)-min(x_n)}$

Can you find circumstances under which this estimator outperforms the OLS estimator?

### Week 6: Mincer + heteroskedasticity

Have a look at Wooldridge’s Examples 7.6 and 8.1 about log hourly wages.

1. Obtain the data and load it into R.
2. Construct the dummy variables for the different groups.
3. Using R’s lm, estimate both regression in 7.6. Interpret your results, paying particular attention to the fact that the dependent variable is in logs, that some explanatory variables are binary, and that other variables enter quadratically.
4. Write a new function lm_robust, that returns the results from lm, but with the standard errors replaced by the robust standard errors. This function will be very valuable in the upcoming replication exercises. You will find help in this week’s slides, as well as online.
5. Use your new function to replicate the analysis in Example 8.1.
• Do the robust standard errors matter for inference?
• Can you replicate the results in 8.1?

### Week 7: Fast food

This week, you will conduct an optional practice replication. You will do this by reinvestigation the role of minimum wages on employment.

Visit this page for a data set description, with a reference to the paper that analyzes it.

2. Load the data into R using read_csv in the readr package.

3. Attempt to replicate Table 4, column 1. Replicate it once for full-time employees, and once for part-time employees.

I obtain:

 Dependent variable: dEMPFT dEMPPT (1) (2) STATE 3.449*** -0.917 (1.316) (1.314) Constant -2.737** 0.526 (1.182) (1.181) Observations 392 396 Note: p<0.1; p<0.05; p<0.01
1. Do you have enough information to obtain the exact results in Table 4, Column (i)? Explain.

2. Can you replicate the results in column (iii)?

3. What other variables would you like to have in the data before you make up your mind about the effect of minimum wages on employment?

4. Make a plot that conveys the main message of the paper.

5. What did you learn about the relationship between minimum wages and employment?