Panel Data
Contents
- Framework
- Clustered Standard Errors
- Many FEs
- Implementation in R: felm command
Introduction
- Panel data has observations on \(n\) cross-sectional units at \(T\) time periods: \((X_{it}, Y_{it}\)
- Examples:
- Person \(i\)’s income in year \(t\).
- Vote share in county \(i\) for the presidential election year \(t\).
- Country \(i\)’s GDP in year \(t\).
- Panel data is useful because
- More variation (both cross-sectional and temporal variation)
- Can deal with time-invariant unobserved factors.
- (Not focus in this course) Dynamics of individual over time.
Overview
- Consider the model \[
y_{it} = \beta' x_{it} + \epsilon_{it}, E[\epsilon_{it} | x_{it} ] = 0
\] where \(x_{it}\) is a k-dimensional vector
- If there is no correlation between \(x_{it}\) and \(\epsilon_{it}\), you can estimate the model by OLS (pooled OLS)
- A natural concern here is the omitted variable bias.
- We now consider that \(\epsilon_{it}\) is written as \[
\epsilon_{it} = \alpha_i + u_{it}
\] where \(\alpha_i\) is called unit fixed effect, which is the time-invariant unobserved heterogeneity.
- With panel data, we can control for the unit fixed effects by incorporating the dummy variable for each unit \(i\)! \[
y_{it} = \beta' x_{it} + \gamma_2 D2_i + \cdots + \gamma_n Dn_i + u_{it}
\] where \(Dl_i\) takes 1 if \(l=i\).
- Notice that we cannot do this for the cross-section data!
- We often write the model with unit FE as \[
y_{it} = \beta' x_{it} + \alpha_i + u_{it}
\]
Framework
- The fixed effects model \[
y_{it} = \beta' x_{it} + \alpha_i + u_{it}
\]
- Assumptions:
- \(u_{it}\) is uncorrelated with \((x_{i1},\cdots, x_{iT})\), that is \(E[u_{it}|x_{i1},\cdots, x_{iT} ] = 0\)
- \((Y_{it}, x_{it})\) are independent across individual \(i\).
- No outliers
- No Perfect multicollinarity
- Let’s discuss Assumptions 1, 2, and 4 in detail.
- Assumption 1 is weaker than the assumption in OLS, because the time-invariant factor \(\alpha_i\) is captured by the fixed effect.
- Example: Unobserved ability is caputured by \(\alpha_i\).
- Assumption 2 allows for serial correlation (i.e., \(Cov(x_{it},x_{it'} ) \neq 0\) ) within individual \(i\).
- This is related to the cluster-robust standard error.
- Assumption 4 seems as usual, but it has an important role in panel data analysis.
- Consider the following regression with unit FE \[
wage_{it} = \beta_0 + \beta_1 experience_{it} + \beta_2 male_{i} + \beta_3 white_{i} + \alpha_i + u_{it}
\] where \(experience_{it}\) measures how many years worker \(i\) has worked before at time \(t\).
- In the regression above, we have multicollinearity issue because of \(male_{i}\) and \(white_{i}\).
- Intuitively, we cannot estimate the coefficient \(\beta_2\) and \(\beta_3\) because those time-invariant variables are completely captured by the unit fixed effect \(\alpha_i\).
FE, FE, and FE
- In addition to unit FE, you can also add time fixed effects (FE) \[
y_{it} = \beta' x_{it} + \alpha_i + \gamma_t + u_{it}
\]
- The regression above controls for both time-invariant individual heterogeneity and (unobserved) aggregate year shock.
- Panel data is useful to capture various unobserved shock by including fixed effects.
Panel + IV
- You can use IV regression with panel data. This is PS5.
Standard Errors
- In the cross-section data, we need to use the heteroskedasticity robust standard error.
- Remember: Heteroskedasticity means \(Var(u_i | x_i) = \sigma(x_i)\).
- In the panel data setting, we need to consider the autocorrelation of the variable, that is the correlation between \(u_{it}\) and \(u_{it'}\) across periods for each individual \(i\).
- The current standard is to use so-called cluster-robust standard error.
- The cluster is unit \(i\). The observations within cluster are allowed to be freely correlated.
- Cluster-robust standard error takes care for such correlation.
- I will explain how to deal with this in R.