---
title: "Instrumental variables"
output:
beamer_presentation:
toc: true
slide_level: 2
fig_height: 3.5
fig_width: 6
header-includes:
- \usepackage{amsmath}
---
# Readings
## Prep
```{r}
library(tidyverse)
```
## Readings
- These notes supplement readings
- Chapter 15+16
- skip 15.6, 15.7
- skip 16.4, 16.5, 16.6
## Overview
4 hours:
1. Introduction to endogeneity
2. Introduction to instrumental variables
3. Examples
4. Estimation
# Endogeneity
## Linear model
**Regression equation**
$$ Y = X \beta + u $$
or
$$ Y_i = \sum_k \beta_k X_{ik} + u_i $$
plus **assumptions**:
1. Random sampling
2. No multicollinearity
## Linear model (2)
\includegraphics[scale=0.04]{question-mark.png}
What's missing?
## Endogeneity
The $k-$th regressor is **exogenous** if $E(u_i | X_{ik})=0$.
It is **endogenous** if $E(u_i | X_{ik})\neq 0$
## Diagram
Diagram for $k=1$.
[causal diagram when E4 holds]
## Diagram (2)
[scatterplot with $E(u_i|X_i)>0$ and the resulting bias in $\hat{\beta}$, in ``sketch92.png``]
# Omitted variables
## Problem
- Existence of $X_2$ with $\beta_2 \neq 0$ and $Cov(X_1,X_2)\neq 0$.
- Algebra: true model and estimated model
True model:
$$ Y_i = X_{1i} \beta_1 + X_{2i} \beta_2 + u_i$$
with
$$ \beta_2 >0$$
and
$$Cov(X_{1i},X_{2i})>0$$
## Problem (2)
Estimated model:
$$ Y_i = X_{1i} \beta_1 + \tilde{u}_i$$
so that $$\tilde{u_i} = ?\cdots?$$
## Problem (3)
\includegraphics[scale=0.04]{question-mark.png}
- In the **estimated model**, is the error term uncorrelated with the regressor?
- Is the OLS estimator for $\beta_1$ in the estimated model unbiased?
[solution and wage example: ``sketch-9.3.png``]
## Diagram
[causal diagram: you obtain direct effect of $X_1$ and indirect effect through $X_2$]
## Solution (1)
Add the omitted variable.
## Solution (2)
If you have repeated measurements, can do fixed effects.
# Types of endogeneity
## Endogeneity: types
1. Omitted variables
2. Measurement error
3. Selection
4. Simultaneity
## Measurement error
[from Chapter 9] Effect on GPA of
- family income:
$$ colGPA_i = \beta_0 + \beta_1 faminc_i^* + \beta_2 hsGPA_i + \beta_3 SAT_i + u_i $$
- smoking marijuana:
$$ colGPA_i = \beta_0 + \beta_1 smoked_i^* + \beta_2 hsGPA_i + \beta_3 SAT_i + u_i $$
## Measurement error
\includegraphics[scale=0.04]{question-mark.png}
- Why may $faminc^*$ be mismeasured?
- Why may $smoked^*$ be mismeasured?
- How about measurement error in your problem sets?
## ME: Algebra
Again, we distinguish the true model from the estimated model:
- **True model** is $$Y_i = \beta_0 + \beta_1 X_i^* + u_i$$
- **Estimated model** is $$Y_i = \beta_0 + \beta_1 X_i + \tilde{u}_i$$
- Mismeasurement: $$X_i = X_i^* + v_i$$
## ME: Algebra (2)
We would like to estimate the true model, but we do not have access to $X_i^*$.
Start from the true model and replace:
$$
\begin{aligned}
Y_i &= \beta_0 + \beta_1 X_i^* + u_i \\
&= \beta_0 + \beta_1 (X_i-v_i) + u_i \\
&= \beta_0 + \beta_1 X_i + (u_i - \beta_1 v_i) \\
&= \beta_0 + \beta_1 X_i + \tilde{u}_i.
\end{aligned}
$$
## ME: Algebra (3)
\includegraphics[scale=0.04]{question-mark.png}
Is $\tilde{u}_i \perp X_i$?
## Simultaneity
[causal diagram]
## Simultaneity: example
Housing prices and savings:
1. Savings decisions affect what you spend on housing
2. How much you spend on housing affects the amount you save
Say:
$$
\begin{aligned}
housing_t = \alpha_1 savings_t + \beta_{10} + \beta_{11} inc_t + \beta_{12} educ_t + \beta_{13} age_t + u_{1t} \\
savings_t = \alpha_2 housing_t + \beta_{20} + \beta_{21} inc_t + \beta_{22} educ_t + \beta_{23} age_t + u_{2t}
\end{aligned}
$$
## Simultaneity: endogeneity
Let's perturb the housing market (say a tax on foreign buyers). Holding everything else constant...
- $u_{1t}$ goes up
- $housing_t$ goes up (x 1)
- $savings_t$ goes up (x a2)
Therefore, in the housing price equation, $u_{1t}$ is correlated with $savings_t$, which means that $savings_t$ is endogenous.
## Selection: example
Angrist's (1990) paper on wages for Vietnam veterans.
$$ Y_i = \beta_0 + \beta_1 D_i + u_i $$
where
- $Y_i$ is the log of wages
- $D_i$ is a binary variable indicating whether individual $i$ fought in Vietnam
- the regression equation includes many other (control) variables, which we omit here for convenience
## Selection: remark
- This is a very common case, where we are interested in the effect of a **policy** $D$ on an outcome variable $Y$. The literature that studies such policy intervention is called **program evaluation**.
- Main concern: individuals that choose $D_i = 1$ are different from those who choose $D_i = 0$. This is called **self-selection**
- Problematic if the reason for choosing $D_i=1$ is correlated with $u_i$, in which case we are back to the **omitted variables** case.
## Selection: Vietnam
\includegraphics[scale=0.04]{question-mark.png}
In Angrist's setting, what is the selection problem? In other words:
1. What is in $u_i$?
2. Why is $u_i$ correlated with $D_i$?
# Instrumental variables
## IV: idea
- **instrumental variables** methods work when $X$ is endogenous, not just because of omitted variables.
\bigskip
- Breakthrough in empirical economics.
- Originally from 1950's to deal with simultaneity supply/demand
- Popularized in microeconometrics in 90's.
## IV: idea (2)
[causal diagram, ``sketch-9.3.png``]
- an IV is a quantity outside of the model
- has a priori nothing to do with $y$
- is correlated only with $X$
- replaces E.4: $$E(u_i|X_i) \neq 0,~ E(u_i|Z_i)=0$$
## IV: idea (3)
[flipped causal diagram, ``sketch-9.6.png``]
## IV: requirements
For an instrument to be useful, we need that
- $Z$ predicts the endogenous variable $X$ (**relevance**)
- testable
- $Z$ is uncorrelated with $u$ (**validity**)
- not testable
More care is required when additional (exogenous, endogenous) regressors are involved: _next week_
## Vietnam
- Angrist (1990)
- Binary $X$ and $Z$:
$$\log(wage_i) = \beta_0 + \beta_1 veteran_i + u,$$
- Instrument: $Z_i = 1\{lottery draw < CUTOFF\}$
## Vietnam: instrument properties
\includegraphics[scale=0.04]{question-mark.png}
- Is $Z_i$ relevant?
- Is $Z_i$ valid?
## Vietnam: exogeneity
High draft number: companies will invest in your education, less worried about you leaving
## IV: algebra
- One endogenous regressor $X_i$
- One instrument $Z_i$
$$Y_i = \beta_0 + \beta_1 X_i + u_i$$
and $Cov(X_i,Z_i) \neq 0,~E(u_i|Z_i)=0$.
\bigskip
**Note**:
- Not necessarily $E(u_i|X_i)=0$
- If $E(u_i|X_i)=0$ then pick $Z_i=X_i$
## IV: Algebra (2)
Take the covariance with $Z_i$ on LHS and RHS
[whiteboard,``sketch-9.5.png``]
## IV: Algebra (3)
For the IV estimator,
$$
\begin{aligned}
\hat{\beta}_{1,iv} &= \frac{\hat{Cov}(y_i,Z_i)}{\hat{Cov}(X_i,Z_i)} \\
&= \frac{\hat{Cov}(y_i,Z_i) / \hat{Var(Z_i)}}{\hat{Cov}(X_i,Z_i) / \hat{Var(Z_i)}} \\
\end{aligned}
$$
so that it is the ratio of the regression coefficient in ``lm(y~Z)`` on that in ``lm(X~Z)``.
## IV: Algebra (4)
[corresponding picture, ``sketch-9.7.png``]
# Hour 3: Examples
## Examples: intro
- Sequence of examples
- Structure:
0. The model is $$Y_i = \beta_0 + \beta_1 X_i + W_i \gamma + u_i, ~ E(u_i | W_i, Z_i) = 0$$
1. List $(Y_i, X_i, Z_i, W_i)$
2. Why do we care about the **causal** effect of $X_i$ on $Y_i$
3. What is in $u_i$?
4. Why is $E(u_i | X_i ) \neq 0$, i.e. what is the **endogeneity problem**?
5. Why is $Cov(Z_i,X_i) \neq 0$: instrument **relevance**
6. Why is $E(u_i | Z_i) = 0$: instrument **validity**
## Examples: intro (2)
Once you believe 6. and have checked 5., you can use ``ivreg(y~X+W|Z+W)`` to estimate the causal effect of $X_i$ on $Y_i$. Theory: later.
## Examples: source
The following examples, and several others, can be found in Murray (2004).
## Example 0: Colonial origins
**PS7**
## Example 1: Twin study
**PS7**. Ashenfelter and Krueger (1994).
1. $(Y,X,Z)$:
- $Y$: $\Delta$ income
- $X$: $\Delta$ education
- $Z$: $\Delta$ education (cross-reported)
## Example 1 (2)
\includegraphics[scale=0.04]{question-mark.png}
2. Why do we care about the causal effect of education on income?
3. What is in $u_i$?
## Example 1 (3)
\includegraphics[scale=0.04]{question-mark.png}
4. Why is $\Delta$ education endogenous?
5. Why is $Cov(Z_i,X_i) \neq 0$?
6. Why is $E(u_i | Z_i) = 0$?
## Example 1 (F)
\includegraphics[scale=0.04]{question-mark.png}
**Finding.**
## Example 1 (A)
3. Things that influence wages other than education
4. Measurement error
5. Twins probably know something about each other's education level.
6. Valid: guessing with mean-0 error, not related to other factors, _after controlling for true education_
7. What if I think my brother's education is higher **because** he has high income?
## Example 2: Public housing
\includegraphics[scale=0.4]{sketch-9-8.png}
## Example 2: Public housing
\includegraphics[scale=0.4]{sketch-9-9.png}
## Example 2: Public housing
Currie and Yelowitz (2000).
- Survey data from the U.S. (households with two children)
- Regressions of school performance on indicator for "does family live in public housing": negative correlation
## Example 2 (2)
1. $(Y,X,Z)$:
- $Y$: school performance
- repeat grade
- change schools
- $X$: living in public housing
- $Z$: sex composition: two boys / boy-girl / two girls
- $W$: empty
2. ??? Why do we care about the causal effect of $X$ on $Y$?
3. ??? What is in $u_i$?
## Example 2 (3)
"Public housing" is a program where you pay a fee of 25% of your income, and you are allocated an apartment depending on your family (gender) composition.
## Example 2 (4)
\includegraphics[scale=0.04]{question-mark.png}
4. Why is "public housing" indicator endogenous?
5. Why is $Cov(Z_i,X_i) \neq 0$?
6. Why is $E(u_i | Z_i) = 0$?
## Example 2 (F)
**Finding.**
- Households with one boy and one girl are 24% more likely to live in public housing
- IV estimates: better outcomes if you live in public housing (11% less likely to be "left behind")
## Example (2): A
2. Equal opportunity to education: if public housing has a negative effect on school performance, and you care about equality of opportunity, then that is a motivation for housing vouchers.
4. Public housing tenants may have unmeasured traits that contribute to poor academic performance.
5. Having a boy and girl means you need two rooms, and public housing becomes more attractice, since public housing comes at a fixed cost.
Whether you have a boy or girl is determined by chance. Will be correlated with your decision to live in projects. But is independent of other stuff, including $u_i$.
## Example (3): Incarceration and crime
Atlantic:
\includegraphics[scale=0.4]{sketch-9-10.png}
## Example (3): Incarceration and crime
ABC news: it's complicated
\includegraphics[scale=0.4]{sketch-9-11.png}
## Example (3): Incarceration and crime
- Levitt (1996)
- U.S. state-level data
1. $(Y,X,Z)$:
- $Y$: crime rate
- $X$: incarceration rate
- $Z$: was a lawsuit filed against state for overcrowding?
- $W$: empty
## Example (3): Incarceration and crime
\includegraphics[scale=0.04]{question-mark.png}
2. Why do we care about the causal effect of $X_i$ on $Y_i$?
3. What is in $u_i$?
4. Why is "incarceration rate" endogenous?
5. Why is $Cov(Z_i,X_i) \neq 0$?
6. Why is $E(u_i | Z_i) = 0$?
## Example 3 (F)
**Findings.**
- OLS: negative, small
- IV: 2-3 times the size of OLS: incarceration works
## Example 3 (A)
4,
5. When faced with a lawsuit, states preemptively reduce crime rates. In case the lawsuits are won, they must further reduce incarceration rates.
## Example (4): education and income
\includegraphics[scale=0.4]{sketch-9-12.png}
## Example (4): Canada
Manitoba, Ontario and New Brunswick: 18
Others: 16
Who is right?
## Example (4): education and income
- Ashenfelter and Krueger (1991)
- U.S. individual level data on men
1. $(Y,X,Z)$:
- $Y$: income
- $X$: years of education
- $Z$: quarter of birth [Jan-Mar; Apr-Jun;Jul-Sep;Pct-Dec]
- $W$: empty
## Example (4)
\includegraphics[scale=0.04]{question-mark.png}
2. Why do we care about the causal effect of $X$ on $Y$?
3. What is in $u_i$?
4. Why is "education" endogenous?
5. Why is $Cov(Z_i,X_i) \neq 0$?
6. Why is $E(u_i | Z_i) = 0$?
## Example 4 (F)
- Findings similar to OLS
- Conclusion: endogeneity is not much of a problem
- Spurred "weak instruments" literature in microeconometrics
## Example (4): A
2. Government needs to set compulsory schooling age
3.
4.
5. Correlation:
- To enter first grade in Sep 2005, you must turn 6 before Jan 1, 2006.
- If you are born early in the year, you will start school later:
- born Dec 30, 1990: start school in Sep 1990
- born Jan 2, 1991: start school in Sep 1991
- You are allowed to leave school as soon as you turn 16 (17, depends on state).
- Born earlier in the year: you will leave school, on average, with less education
# Hour 4: Two stage least squares
## Model
$$
\begin{aligned}
Y_i &= X_i \beta + u_i \\
0 &= E(u_i|Z_i)
\end{aligned}
$$
where $X_i = (X_{1i},X_{2i})$ and
$Z_i = (X_{2i},...)$.
## Overview
1. Fix $\dim(Z_i) = \dim(X_i) = K$
- show $\beta = E(Z_i'X_i)^{-1}E(Z_i'y_i)$
- show analog estimator is consistent
2. Consider $\dim(Z_i)>\dim(X_i)$
- reduce dimension to $K$ through rotation $A'$
- creates instruments $AZ_i$
- apply 1.
3. Two stage least squares
- corresponds to $A = ...$
- intuition
4. Topics:
- $X=Z$
- weak instruments
- standard errors
## Blah