
title: "Panel data"
output:
beamer_presentation:
toc: true
keep_tex: true
slide_level: 2
fig_height: 3.5
fig_width: 6
headerincludes:
 \usepackage{amsmath}

# Intro
## Prep
```{r}
library(tidyverse)
```
## Panel data
A **panel data** set is a data set that has two dimensions:
 crosssectional
 time
## PD: Example
```{r}
data("CigarettesSW",package="AER")
qplot(tax,packs,data=CigarettesSW) + facet_wrap(~year)
```
## PD: Example (2)
```{r}
data("PSID7682",package="AER")
head(PSID7682 %>% select(id,year,wage,education),20)
```
## Repeated measurements
Panel data gives us **repeated measurements** on the same unit.
\bigskip
 multiple **time periods** for a country
 identical twins
 multiple babies to a mother
 multiple outcome variables
 opponents in a tournament
## Why use panel data?
1. Panel data sets contain **more information**
 smaller standard errors than comparable crosssections
2. Estimate how (conditional) distributions change over time
3. Real innovation:
 estimate models with heterogeneous individuals
 Under E.1E.4, all individuals are the same, conditional on $X_i$
 $E[u_iX_i]=0$: people are identical, in expectation
 Repeated measurements allows you to control for the fact that people are different:
 **unobserved heterogeneity**
## Panel data model: sampling
We have a large random sample of size $n$ on
$$ (Y_i,X_i) = ((Y_{i1},\cdots,Y_{iT}),(X_{i1},\cdots,X_{iT})),$$
for a total number of $nT$ observations.
## Panel data model: equation
The **regression equation** for a panel data model looks like:
$$ Y_{it} = h(X_{it},\beta,\alpha_i,\gamma_t,u_{it}),$$
where, in addition to the additional indices $Y,X,u$ we have unitspecific parameters $\alpha_i$ and timespecific parameters $\gamma_t$.
## Unobserved heterogeneity
The unobserved heterogeneity, or "fixed effect", $\alpha_i$ allows for individuals to
 structural models: have different production functions / utility functions
 reduced form: have systematically different APEs
Economists are generally unwilling to assume $\alpha_i \perp X_i$. The following examples provide some motivation.
# Examples
## Intro
(from ECON435)
 To understand why panel data can be useful: examples.
 In each example: $$ y_{it} = \alpha_i + X_{it} \beta + u_{it}$$
 Repeated measurements for each $i$ across $t$
 $u_{it}$ will be an error terms as in the first part of this course
 $\alpha_i$ represents a unitspecific error term:
 unobserved (not in $X$) characteristics of the individual
 timeinvariant
 Questions:
 What is in $\alpha_i$?
 Is it correlated with $X_{it}$?
## Hint:
 Yes, it is correlated with $X_{it}$.
 (But: Why?)
 Preview of panel data models:
 $E(u_{it}+\alpha_i  X_{it}) \neq 0$
## Example 1: Birthweight and smoking
 Data on
 birthweight of newborns $(y_{it})$
 smoking behavior of their mother $X_{1,it}$
 $(i,t)$ refers to the $t$th newborn of mother $i$
Linear panel data model:
$$y_{it} = \alpha_i + X_{it} \beta + u_{it}$$
where $X_{it}$ includes
 $X_{1it}$,
 age
 income,
 education
## Example 1: Questions
\includegraphics[scale=0.04]{questionmark.png}
1. What are other factors that could influence a baby's birthweight?
2. Do you believe that those factors do not change over time?
3. Do you believe that those factors are correlated with smoking?
## Example 1: Answers
1: genetics, diet, exercise, healthy behavior, diligence in precautions for baby's outcomes
2: genetics: yes, diligence: probably not, healthy behavior: probably pretty persistent
3: genetics: tricky, should be different genes that make you tall and that make you smoke
The _unobserved heterogeneity_ can be _specific_, unincluded variables, or _vaguer_ terms such as "health awareness", "healthy behavior".
## Example 2: Skytrain
 Data on housing prices in two periods, $t=1,2$
 In between, a skytrain is built
 $D_i$ is a binary indicator: is skytrain <5min walk from house $i$?
Relationship:
$$y_{it} = \alpha_i + D_{i} \beta + u_{it} + X_{it} \gamma$$
where $X_{it}$ is a vector of the characteristics of the house:
 number of bedrooms,
 bathrooms, and
 square footage, and the
 year that it was built
 ...
## Example 2: Skytrain: Questions
1. What does $\alpha_i$ capture?
2. Why is it correlated with D_i ?
## Example 2: Skytrain: Answers
1. Amenities (parks, schools, shopping); location (density, ...)
2. Think about the decision to build the skytrain station. there are several ways in which this decision making process can induce correlation between $\alpha_i$ and $D_{it}$. If the social planner is trying to develop a new neighhbourhood, they may be looking for a spot with cheap (c.p.) plots. Alternatively, they may be targeting areas with high density, or with a lot of amenities, because it likely increases the usage rate. These are two opposing mechanisms. It is unlikely that they cancel out exactly.
Compare the **incinerator** example in Wooldridge.
## Example 3: Texting bans
 Existing literature: $$P(deathdriving+phone) = 4 \times P(deathdriving + nophone)$$
 People continue to text. Why?
$$ Y_{i,m} = \alpha_i + \delta_m + X_{im}\beta + \omega B_{im} + u_{im} $$
where:
 $i$ is state, $m$ is month
 $Y$ is (log of) traffic fatalities
 $X$ includes
 population
 proportion male
 unemployment
 gas tax
 $B$: is a texting ban in place?
## Example 3: Texting bans (2)
 What's in $\alpha_i$?
 Correlated with $X$?
 Correlated with $B$?
## Example 3: Texting bans
Finding: $\hat{\omega} = 0.0374.$
 Interpret this finding.
Details:
 No effect for "weak bans"
 No effect except for singleoccupancy vehicles
 Effect starts when findings are announnced, disappears four months after ban in effect
## Example 4: Mafia and public spending
From PS8, AER(2014)
$$ Y_{it} = \alpha_i + G_{it} \beta + \gamma_t + u_{it} + X_{it} \beta$$
 $i$ is an Italian province, $t$ is a year (19901999)
 $Y_{it}$ is the rate of growth
 $G_{it}$ is government spending on infrastructure in state $i$
 $X_{it}$: controls
## Example 4: Mafia
In terms of growth rates:
 What is captured by $\gamma_t$?
 What is captured by $\alpha_i$?
Why is $\alpha_i$ correlated with $G_{it}$?
\bigskip
Note: Paper uses panel data and IV.
## Example 5: A communitycollege teacher like me
Fairlie et al, AER(2014)
$$ Y_{ic} = \alpha_i + \lambda_c + \beta 1 Z_{ic} + u_{ic}
+ X_{ic} \gamma$$
where
 $Y_{ic}$:
 dropped course?
 passed course  finishing
 grade  finishing
 good grade?  finishing
 enrolled in a similar course subsequently?
 $Z_{ic}$ is an indicator for whether student $i$ and $j$ are part of the same minority
 $X_{ic}$ is a vector of controls
 What does $\alpha_i$ capture?
 What does $\lambda_c$ capture?
## Example 5: A
$\lambda_c$ and $\alpha_i$
> control[s] for instructor fixed effects and minorityspecific course fixed effects. The former controls for the possibility that minority students take courses from instructors who have systematically different grading policies from other instructors, while the latter controls for selection by comparative advantage where minority students are drawn to courses that are a particularly good match or in which minority students are drawn to courses that are a particularly good match.
p. 2574, Fairlie, Hoffmann, Oreopolous (2014)
## Example 5: Findings
No findings without fixed effects. With fixed effects:
 dropped course?: $0.02***$
 passed course  finishing: $0.012$
 grade  finishing: $0.054***$
 good grade?  finishing $0.024***$
 enrolled in a similar course subsequently? $0.013*$
## Example 6: Income and democracy
**PS6**
$$ democracy_{it} = \alpha_i + GDP_{it} \beta + u_{it}$$
1. Reverse causality?
2. What is in $\alpha_i$?
# Does it matter?
## Introduction: beertax example
(from Stock and Watson)
```{r}
## Load the fatality data
library(haven)
beer_fatality < read_dta("fatality.dta")
summary(beer_fatality)
```
## beertax:plot
```{r}
library("ggplot2")
qplot(beertax,mrall,data=beer_fatality)
```
## beertax: OLS
```{r}
## Scale up fatailities for readability
beer_fatality$mrall < beer_fatality$mrall*100000
ols_reg < lm(mrall~beertax,data=beer_fatality)
summary(ols_reg)
```
## beertax: Plot for each year:
```{r}
p < qplot(beertax,mrall,data=beer_fatality)
p < p + facet_wrap(~year)
p
```
## beertax: Fixed effects results
```{r}
# Fixed effects estimator
fe_reg < lm(mrall~beertax+as.factor(state),data=beer_fatality)
library(stargazer)
stargazer(ols_reg,fe_reg,
type = "text",
keep = "beertax",
keep.stat = c("n"))
```
## Takeaways
1. What is in $\alpha_i$?
 "unobserved heterogeneity", or:
 "fixed effect", or:
 an intercept specific to the crosssection unit, or:
 omitted variable that does not change over time
2. Key feature of panel data in economics: **Un**observed heterogeneity is generally correlated with the **observ**ables $X_{it}$
## Unobserved heterogeneity: problem
 1: Geometrically: _sketch81_ and _sketch82_
 2: Information from within and between dimension
 3: Between information not reliable when $\alpha_i \not\perp X_i$
# Incidental parameters problem
## Unobserved heterogeneity
To solve the issue that $\alpha_i$, when treated as a RV, is correlated with $X_i$ in most economic applications, we can treat it as a parameter to be estimated along with $\beta$.
This leads to the **incidental parameters problem**:
 the size of the parameter space grows with $n$
 not covered by standard extremum estimation proof
 inconsistency for conventional estimators of the common parameters
 unless $T \to \infty$ or
 model is linear + static
## Example
(from Manuel Arellano)
$$ X_{it} \sim \mathcal{N}(\alpha_i,\sigma^2),~i=1,\cdots,n,~t=1,\cdots,T $$
where $T\ geq 2$, $n \to \infty$ and sampling is iid across $(i,t)$.
## Example: likelihood
 This is a fully specified model for $X_{it}$: use MLE.
 Objective: consistent estimator for the common parameter $\sigma^2$
The loglikelihood is given by
$$\mathcal{L}(\sigma,(\alpha_i)_i) \propto nT \log \sigma 
\sum_i \sum_t \frac{1}{2} \left( \frac{X_{it}\alpha_i}{\sigma} \right)^2$$
## Example: individual parameter estimates
Solve FOC for $\hat{\alpha}_i$:
$$ 0 = \frac{1}{\hat{\sigma}^2} \sum_t \left( X_{it}  \hat{\alpha}_i\right) $$
so that
$$ \hat{\alpha}_i = \frac{1}{T} \sum_t X_{it} \equiv \bar{X}_i$$
## Example: common parameters
Solve FOC for $\hat{\sigma}_i$:
$$ 0 = \frac{nT}{\sigma} + \frac{1}{\hat{\sigma}^3} \sum_i \sum_t \left(X_{it}\hat{\alpha}_i\right)^2$$
so that
$$ \hat{\sigma}^2 = \frac{1}{nT} \sum_i \sum_t \left(X_{it}\bar{X}_i\right)^2$$
## Example: inconsistency
$$
\begin{aligned}
\hat{\sigma}^2 &\stackrel{p}{\to} E \frac{1}{T} \sum_t \left(X_{it}\bar{X}_i\right)^2 \\
&= \frac{T1}{T} \sigma^2
\end{aligned}
$$
where the first equality follows from the LLN. For the last equality, remember the degreeoffreedom correction for the sample variance in undergrad.
Conclusion:
 inconsistent by a factor 2 if $T=2$
 inconsistency disappears as $T \to \infty$
## IP: solutions
Solutions to the incidental parameters problem:
0. Bias corrections: $\times T / (T1)$
1. Transforming the model to remove $\alpha_i$
2. Finding a sufficient statistic for $\alpha_i$
3. Large$T$ solutions + biasreduction
## Taxonomy
Solutions to the IP problem tend to be model or classspecific. The remainder of this lecture gives you a menu of models and explains how the IP problem has been solved in those settings.
1. Relationship $(\alpha_i,X_i)$: "fixed effects" v "random effects"
2. Number of time periods $T$: "fixedT" v "largeT"
3. Linear v nonlinear
4. Dynamic v static
5. Onedimensional v multidimensional heterogeneity; additivity
see ``sketch72.png``
# The static linear model
## Model equation
In the static linear model, the regression equation states that for all $i=1,\cdots,n$ and for all $t=1,\cdots,T$:
$$ Y_{it} = X_{it} \beta + \alpha_i + u_{it} $$
Stack these equations for each $i$ across $t$ into
$$ Y_i = X_i \beta + \alpha_i \iota_T + u_i = X_i \beta + v_i$$
where:
 $Y_i$ is a $T \times 1$ vector consisting of the dependent variables for unit $i$
 $X_i$ is a $T \times K$ matrix
 $\iota_T$ is a $T \times 1$ column vector of ones
 $v_i$ is the **composite error**
## Model
To turn this equation into a model, we need to impose distributional assumptions on
$$(\alpha_i,X_i,u_i)$$
We will do this  unconventionally  at the end of this lecture. We will first define a class of estimators.
## Estimators
The object of interest is $\beta$. All estimators for $\beta$ will be of the form
$$ \hat{\beta}_A = \left( \sum_i X_i' A' A X_i \right)^{1}
\left( \sum_i X_i' A' A y_i \right)$$
this corresponds to the OLS estimator in the **transformed** linear model
$$ AY_i = AX_i \beta + Av_i $$
## Estimator 1: Pooled OLS
 "Run a regression of $Y$ on $X$".
 Corresponds to $A=I_T$
 Ignores panel aspect
\bigskip
Preview: consistency requires ... ?
## Estimator 2: LSDV
 "Add a dummy variable for each country"
 Corresponds to estimation in
$$
\begin{aligned}
Y_i &= X_i \beta + \sum_j 1\{i = j\} \alpha_i \iota_T + u_i \\
&= X_i \beta + D_i \alpha + u_i
\end{aligned}
$$
where $D_i$ is a $T \times n$ matrix in which each column corresponds to a dummy variable for a country.
## Estimator 2: LSDV: FWL
By FWL, we know that $y$ on $X_1$ and $X_2$ is equivalent to
1. regressing $y$ on $X_1$, call residuals $y^*$
2. regressing $X_2$ on $X_1$, call residuals $X_2^*$
3. regressing $y^*$ on $X_2^*$
Equivalence is for $\hat{\beta}_2$:
## Estimator 2: LSDV: FWL
Running OLS of $y$ on $\{1\{i==j\}, j\}$ and X is equivalent to:
1. $y$ on $\{1\{i==j\}, j\}$, call residuals $My$
2. $X$ on $\{1\{i==j\}, j\}$, call residuals $MX$
3. $My$ on $MX$
## Estimator 2: LSDV: FWL (2)
 coefficient estimates are $(Z'Z)^{1} Z'y$, so the
 predictions are $Z(Z'Z)^{1} Z'y$, so the
 residuals are $$y  Z(Z'Z)^{1} Z'y = [ I  Z(Z'Z)^{1} Z' ] y$$
## Estimator 2: LSDV: FWL (3)
$$Z(Z'Z)^{1} Z' = \frac{1}{T} Z Z'$$
## Estimator 2: LSDV: FWL (4)
Conclusion:
$$ I  Z(Z'Z)^{1} Z' =
\left( \begin{matrix} 11/T & 1/T & 1/T & 0 & 0 & 0 \\
1/T & 11/T & 1/T & 0 & 0 & 0 \\
1/T & 1/T & 11/T & 0 & 0 & 0 \\
0 & 0 & 0 & 11/T & 1/T & 1/T \\
0 & 0 & 0 & 1/T & 11/T & 1/T \\
0 & 0 & 0 & 1/T & 1/T & 11/T
\end{matrix} \right)$$
so that the LSDV estimator for $\hat{\beta}$ is equivalent to $\hat{\beta}_{A_\text{FE}}$ with
$$A_\text{FE} = I_T  \frac{1}{T} \iota_T \iota_T'$$
## Estimator 2: LSDV + FE
An alternative way to think about the LSDV estimator.
In
$$Y_{it} = X_{it}\beta + \alpha_i + u_{it}$$
take averages across time on both sides of the equality to obtain
$$ \bar{Y}_i = \bar{X}_i \beta + \alpha_i + \bar{u}_i $$
Now subtract the latter from the former to obtain
$$ \tilde{Y}_i = \tilde{X}_i \beta + \tilde{u}_i $$
## Estimator 2: FE
 We use the timeinvariance of $\alpha_i$ to eliminate it from the (transformed) model.
 Relevant property of the transformation matrix $A$ is that $A \iota_T = 0$.
 If $A \iota_T = 0$, then $Av_i = Au_i$
_Preview_ Assumption for consistency of FE: $E(Au_i  AX_i) = 0$. No restriction on $(\alpha_i,X_i)$.
## Estimator 3: FD
An alternative way to exploit the timeinvariance of the unobserved heterogeneity is
$$
\begin{aligned}
Y_{it} &= X_{it}\beta + \alpha_i + u_{it} \\
Y_{it1} &= X_{it1}\beta + \alpha_i + u_{it1} \\
\Delta Y_{it} &= \Delta X_{it} \beta + \Delta u_{it}
\end{aligned}
$$
The resulting, differenced, equation also does not feature $\alpha_i$.
## Estimator 3: FD: Transformation
The corresponding transformation is
$$ A_\text{FD} = \left( \begin{matrix} 1 & 1 & 0 & 0 \\
0 & 1 & 1 & 0 \\
0 & 0 & 1 & 1
\end{matrix} \right)$$
which also satisfies $A \iota_T = 0$.
## Estimator 3
Some other possibilities:
$$
\begin{aligned}
A_\text{LD} &= \left( \begin{matrix} 1 & 0 & 0 & 1 \\
0 & 1 & 0 & 1 \\
0 & 0 & 1 & 1
\end{matrix} \right) \\
A_\text{FO} &= \left( \begin{matrix} 1 & 1/3 & 1/3 & 1/3 \\
0 & 1 & 1/2 & 1/2 \\
0 & 0 & 1 & 1
\end{matrix} \right)
\end{aligned}
$$
## Estimator 4: Between
Between estimator sets
$$A_\text{BE} = \frac{1}{T} \iota_T \iota_T'$$
 Note that $A_\text{BE} = \iota_T$
 Corresponds to running a regression on country averages
## Estimator 5: Random effects
The random effects estimator is the optimal linear combination of the between and within (FE) estimators.
## Consistency
``sketch73.png``
1. Consistency of OLS
2. Discuss exogeneity and multicollinearity assumptions in the model $$AY_i = AX_i \beta + u_i$$
## What's next
 We have seen the importance of assumptions such as
$$ E(u_{is}X_it)=0 \, \forall t\leq s$$
 Verify for a given example: texting bans
 We can build estimators from the ground up using those assumptions as a starting point.
 Do this using GMM
So, first: review of GMM
# GMM for panels
## GMM: Review
1. structure
2. examples
3. consistency and asymptotic normality
4. optimal weight matrix and efficiency
## Panel data GMM
 An observation is a $T$vector
 Every $(s,t)$ exogeneity condition gives a moment
 sequential: $1/2 T (T+1) K$
 strict exogeneity: $T^2K$
 Combine a transformation with exogeneity conditions
## Panel data GMM: example
[FD + strict exogeneity]
## Dynamic panels
A useful application of this framework is in dynamic panels.
 Model
 Why does pooled OLS fail?
 Why does FD fail?
 Absence of serial correlation generates instruments!
## Dynamic panels
[ArellanoBond]
## Dynamic panels
 What if there is firstdegree serial correlation?
 What if $\rho = 0$?

## Conclusion
0. Linear static model + random effects: OLS or RE
1. Linear static models + fixed effects: FD or FE
2. Linear dynamic model: ArellanoBond
3. Nonlinear static model?
# Nonlinear panel data
## Setting
We will study a specific class of models:
 nonlinear: $$E(Y_{it}X_i,\alpha_i)=h(\alpha_i + X_{it}\beta)$$
 fixed $T$
 fixed effects: no restriction on $(\alpha_i,X_i)$
 strict exogeneity: $u_{is} \perp X_{it}$ for all $s,t$
## Main question
 Q: "Can we formulate a consistent estimator for $\beta$?"
 Concern: incidental parameters problem
## Binary choice
 We will focus on the binary choice model
 Solution to IP problem for binary choice applies to a handful of related models
 Generally, different nonlinear models require different solutions
## Binary choice: model
For each $t \in \{1,\cdots,T\}$,
$$
\begin{aligned}
Y_{it}^* &= \alpha_i + X_{it} \beta + u_{it} \\
Y_{it} &= 1\{Y_{it}^* \geq 0\} \\
u_{it}  X_i,\alpha_i &\sim F(u\cdot)
\end{aligned}
$$
_Time homogeneity_: $\alpha_i,\beta,F$ do not depend on $t$.
## Binary choice: model (2)
 available: crosssection for consistent estimation of $L(Y_i,X_i)$
 fixed$T$
 fixed effects: no assumptions on $(\alpha_i,X_i)$
 strict exogeneity
 error terms are identically distributed
## Binary choice model: identification
Informal statement of results in Chamberlain (2010):
 Bounded regressors: identification fails if $F \neq \Lambda$
 Unbounded regressors (a la Manski (1988, Assumption 2(c)):
 identification holds more generally
 $\sqrt{n}$consistency only if $F=\Lambda$
## Binary choice: logit: model
For each $t \in \{1,\cdots,T\}$,
$$
\begin{aligned}
Y_{it}^* &= \alpha_i + X_{it} \beta + u_{it} \\
Y_{it} &= 1\{Y_{it}^* \geq 0\} \\
u_{i}  X_i,\alpha_i &\sim IIDLOG(0,1)
\end{aligned}
$$
where the "IID"ness refers to the $t$dimension: errors terms are independent across $t$
## Binary choice: logit: ML (1)
To show: maximum likelihood estimator is inconsistent for $\beta$ due to the incidental parameters (IP) problem.
Parameters to be estimated:
 $\beta$
 $(\alpha_i,\,i=1,\cdots,n)$  IPs
\bigskip
Derivation follows Arellano's notes.
## Binary choice: logit: ML (2)
$\alpha_i$ enters only through likelihood contribution for individual $i$,
$$ \sum_t \left\{ Y_{it} \ln \Lambda(X_{it}\beta + \alpha_i) + (1Y_{it}) \ln (1 \Lambda(X_{it} \beta + \alpha_i)) \right\} $$
so ML estimator $\hat{\alpha}_i$ solves the FOC
$$ \sum_t \left\{ Y_{it} (1\Lambda(X_{it}\hat{\beta} + \hat{\alpha}_i))  (1Y_{it}) \Lambda(X_{it} \hat{\beta} + \hat{\alpha}_i) \right\} = 0 $$
 Q: Verify the above, using your logit skills from crosssectional binary choice, remembering that $\Lambda'=\Lambda(1\Lambda)$.
## Binary choice: logit: ML (3)
 Proceed with
 $T=2$
 $X_{i1}=0,X_{i2}=1$.
 Leads to analytical solution, leveraging $X_{it}=X_t$
FOC simplifies to:
$$ Y_{i1} (1\Lambda(\hat{\alpha}_i))  (1Y_{i1}) \Lambda( \hat{\alpha}_i) + Y_{i2} (1\Lambda(\hat{\beta} + \hat{\alpha}_i))  (1Y_{i2}) \Lambda( \hat{\beta} + \hat{\alpha}_i) = 0 $$
implying
$$ Y_{i1} + Y_{i2} = \Lambda( \hat{\alpha}_i) + \Lambda( \hat{\beta} + \hat{\alpha}_i)$$
## Binary choice: logit: ML (4)
 Split by cases.
 For **switchers**, with $Y_{i1}+Y_{i2}=1$, use logit symmetry
 Leads to
$$
\hat{\alpha}_i = \begin{cases} \infty & \text{ if }Y_{i1}+Y_{i2}=0 \\
\hat{\beta}/2 & \text{ if }Y_{i1}+Y_{i2}=1 \\
+\infty & \text{ if }Y_{i1}+Y_{i2}=2
\end{cases}
$$
 Only switchers are informative about $\beta$
 Q: Intuition?
## Binary choice: logit: ML (5)
 Next: maximize the log likelihood with respect to $\beta$
 Plug in $\hat{\alpha_{i}}=\hat{\beta}/2$ to obtain
$$
\begin{aligned}
\mathcal{L}_{n}\left(\hat{\beta}\right) &= \sum_{i}1\left\{ Y_{i1}+Y_{i2}=1\right\} \times \\
& \left[Y_{i1}\ln\Lambda\left(\hat{\beta}/2\right)+\left(1Y_{i1}\right)\ln\left[1\Lambda\left(\hat{\beta}/2\right)\right] \right. \\
&+ \left. Y_{i2}\ln\Lambda\left(\hat{\beta}/2\right)+\left(1Y_{i2}\right)\ln\left[1\Lambda\left(\hat{\beta}/2\right)\right]\right].
\end{aligned}
$$
 For the switchers (effective sample) with $y_{i1}+y_{i2}=1$, $$y_{i2}=1y_{i1}$$
 By the symmetry of the logit CDF,
\[
\Lambda\left(\hat{\beta}/2\right)=1\Lambda\left(\hat{\beta}/2\right),
\]
## Binary choice: logit: ML (6)
Then
$$
\begin{aligned}
\mathcal{L}_{n}\left(\hat{\beta}\right) &= \sum_{i}1\left\{ Y_{i1}+Y_{i2}=1\right\} \times \\
& \left[(1Y_{i2})\ln(1\Lambda\left(\hat{\beta}/2\right))+\left(Y_{i2}\right)\ln\left[\Lambda\left(\hat{\beta}/2\right)\right] \right. \\
&+ \left. Y_{i2}\ln\Lambda\left(\hat{\beta}/2\right)+\left(1Y_{i2}\right)\ln\left[1\Lambda\left(\hat{\beta}/2\right)\right]\right] \\
&= 2 \sum_{i}1\left\{ Y_{i1}+Y_{i2}=1\right\}(Y_{i2} \ln \Lambda\left(\hat{\beta}/2\right) + (1Y_{i2}) \ln (1\Lambda\left(\hat{\beta}/2\right)))
\end{aligned}
$$
## Binary choice: logit: ML (7)
 FOC is (**verify**)
$$\sum_{i}1\left\{ Y_{i1}+Y_{i2}=1\right\}\left(\Lambda\left(\hat{\beta}/2\right)  Y_{i2} \right) = 0 $$
 The ML estimator for $\beta$ sets
$$\Lambda\left(\hat{\beta}/2\right) = \frac{\sum_{i}1\left\{ Y_{i1}+Y_{i2}=1\right\}Y_{i2}}{\sum_{i}1\left\{ Y_{i1}+Y_{i2}=1\right\}} \equiv \hat{p} $$
 Note that the sample proportion $$\hat{p} \to p \equiv P(Y_{i1}=0,Y_{i2}=1Y_{i1}+Y_{i2}=1)$$
## Binary choice: logit: ML (8)
Detour:
 Investigate $p=P(Y_{i1}=0,Y_{i2}=1Y_{i1}+Y_{i2}=1,Xi,\alpha_i)$
 Will suppress dependence on $(\alpha_i,X_i)$
Prep:
1. Remember that $$P(Y_{i1}=0) = 1\Lambda(\alpha_i)$$
2. Also, $$P(Y_{i2}=1) = \Lambda(\alpha_i+\beta)$$
3. Because of serial independence,
$$
\begin{aligned}
P(Y_{i1}=0,Y_{i2}=1) &= P(Y_{i1}=0)P(Y_{i2}=1) \\
&= (1\Lambda(\alpha_i)) \Lambda(\alpha_i+\beta)
\end{aligned}
$$
4. Similarly, $$P(Y_{i1}=1,Y_{i2}=0) = \Lambda(\alpha_i)(1\Lambda(\alpha_i+\beta))$$
## Binary choice: logit: ML (9)
Finally,
$$
\begin{aligned}
p &= \frac{P(Y_{i1}=0,Y_{i2}=1,Y_{i1}+Y_{i2}=1)}{P(Y_{i1}+Y_{i2}=1)} \\
&= \frac{P(Y_{i1}=0,Y_{i2}=1)}{P(Y_{i1}=0,Y_{i2}=1)+P(Y_{i1}=1,Y_{i2}=0)} \\
&= \frac{(1\Lambda(\alpha_i)) \Lambda(\alpha_i+\beta)}{(1\Lambda(\alpha_i)) \Lambda(\alpha_i+\beta)+\Lambda(\alpha_i) (1\Lambda(\alpha_i+\beta))}
\end{aligned}
$$
 Q: Show that $\alpha_i$ drops out!
## Binary choice: logit: ML (10)
Using $\Lambda(u) = ]exp(u)/(1+\exp(u))$ and $1\Lambda(u) = 1/(1+\exp(u)),$
$$
\begin{aligned}
p &= \frac{\exp(\alpha_i + \beta)}{\exp(\alpha_i + \beta) + \exp(\alpha_i)} \\
&= \Lambda(\beta).
\end{aligned}
$$
1. Does not depend on $\alpha_i$
2. Does not equal $\Lambda(\beta/2)$
## Binary choice: logit: ML (11)
Conclusion for the maximum likelihood estimator:
$$\hat{\beta} \to 2 \beta$$
## Binary choice: logit: ML (12)
1. Abrevaya (1997, Economics Letters) for the proof that $betahat > 2 beta$ in the more general context for an arbitrary number of regressors that can be anything, not just 0, 1
3. Open question: what is the bias when T>2? We know that $plim betahat \in [T/(T1) beta, 2 beta]$ but that's all. If you can show it, you can send it to Abrevaya's journal. I suspect he will publish it.
4. The consistency is severe, likely decreases with T
5. Easy fix for $T=2: thetatilde = thetahat / 2.$ However, since the inconsistency is only known for this special case,
6. Differencing? Does not work in nonlinear models.
## Binary choice: logit: CMLE (1)
 The solution involves finding a **sufficient statistic** $h(Y_i)$ for the incidental parameter $\alpha_i$, i.e.
$$ P(Y_i = y  h(Y_i),X_i,\alpha_i) = P(Y_i = y  h(Y_i),X_i) $$
 We saw one on the road to showing the inconsistency of MLE: $h(Y_i)=\sum_t Y_{it}$
 Conditional MLE
 Details: Andersen (1973, JASA), Chamberlain (1980, REStud)
 Efficiency: maintained if $P$ is in exponential family (Hahn, 1998, ET)
 includes binary choice logit
 includes linear regression model with normal errors
## Binary choice: logit: CMLE (2)
 Treat case with $T=2$, general $X_i$:
 see ``Panel data  Binary choice CMLE.lyx``
 Algebra for general $T$ is more involved, but the conclusion is similar.
 see ``sketch74,sketch75``
## Binary choice: logit: CMLE (3)
 This derivation suggests a new criterion function to maximize which yields an extremum estimator.
 That criterion function will be concave.
 To establish consistency and asymptotic normality, we would then have to check identification, etc.
 Key insight is that the incidental parameter problem has been avoided by conditioning on the sufficient statistics.
## Binary choice: Manski
The above trick is specific to logit. Can it be extended?
 Not to other, known distribution functions
 However, at the cost of identifiying $\beta$ only up to scale, we can deal with the unknown$F$ case
 Approach turns binary choice model into Manski's semiparametric binary choice model
## Binary choice: Manski: model
$$
\begin{aligned}
Y_{it}^{*} & =\alpha_{i}+X_{it}\betaU_{it}\\
Y_{it} & = 1(Y_{it}^{*} \geq 0)\\
U_{it} & \alpha_{i},X_{i}\sim F(uX)
\end{aligned}
$$
## Binary choice: Manski: ID
Coming up:
\begin{equation}
med\left(d_{i2}d_{i1}\alpha_i,X_{i},d_{i1}+d_{i2}=1\right)=sgn\left(\Delta X_{i}\beta\right)\label{eq:med_sgn}
\end{equation}
 Link to Manski's crosssectional model
 Identical distributions (not necessarily independent) result in median0
## Binary choice: partial effects
Despite having a consistent estimator for $\beta$, we do not have an estimator for the APE or marginal effects. Why?
## Binary choice: partial effects (2)
1. It depends on $\alpha_i$, for which no consistent estimator is available!
2. **Partial identification** of a partial effect:
 Chernozhukov, FernandezVal, Hahn, Newey (2013, ECTA)
 Chernozhukov et al. (2015, JoE)
## Binary choice: extending the results
Talk about ordered logit / transformation paper.
## Nonlinear panels: other models
 count regression
 binomial choice
 ordered logit
 censored choice
 ...