12 Instrumental Variable 1: Framework

12.1 Introduction: Endogeneity Problem and its Solution

When \(Cov(x_k, \epsilon)=0\) does not hold, we have endogeneity problem
- We call such \(x_k\) an endogenous variable.
In this chapter, I introduce an instrumental variable estimation method, a solution to this issue.
The lecture plan
1. More on endogeneity issues
2. Framework
3. Implementation in R
4. Examples

12.2 Examples of Endogeneity Problem

Here, I explain a bit more about endogeneity problems.
1. Omitted variable bias
2. Measurement error
3. Simultaneity

12.2.1 More on Omitted Variable Bias

Remember the wage regression equation (true model) \[ \begin{aligned} \log W_{i} &=& & \beta_{0}+\beta_{1}S_{i}+\beta_{2}A_{i}+u_{i} \\ E[u_{i}|S_{i},A_{i}] &=& & 0 \end{aligned} \] where \(W_{i}\) is wage, \(S_{i}\) is the years of schooling, and \(A_{i}\) is the ability.
Suppose that you omit \(A_i\) and run the following regression instead. \[ \log W_{i} = \alpha_{0}+\alpha_{1} S_{i} + v_i \] Notice that \(v_i = \beta_2 A_i + u_i\), so that \(S_i\) and \(v_i\) is likely to be correlated.
You might want to add more and more additional variables to capture the effect of ability.
- Test scores, GPA, SAT scores, etc…
However, can you make sure that \(S_i\) is indeed exogenous after adding many control variables?
Multivariate regression cannot deal with the presence of unobserved heterogeneity that matters both in wage and years of schooling.

12.2.2 Measurement error

Measurement error in variables
- Reporting error, respondent does not understand the question, etc…
Consider the regression \[ y_{i}=\beta_{0}+\beta_{1}x_{i}^{*}+\epsilon_{i} \]
Here, we only observe \(x_{i}\) with error: \[ x_{i}=x_{i}^{*}+e_{i}\] where \(e_{i}\) is measurement error.
- \(e_{i}\) is independent from \(\epsilon_i\) and \(x_i^*\) (called classical measurement error)
- You can think \(e_i\) as a noise added to the data.
The regression equation is \[ \begin{aligned} y_{i} = & \ \beta_{0}+\beta_{1}(x_{i}-e_{i})+\epsilon_{i} \\ = & \ \beta_{0}+\beta_{1}x_{i}+(\epsilon_{i}-\beta_{1}e_{i}) \end{aligned} \]
Then we have correlation between \(x_{i}\) and the error \(\epsilon_{i}-\beta_{1}e_{i}\), violating the mean independence assumption

12.2.3 Simultaneity (or reverse causality)

Dependent variable and explanatory variable (endogenous variable) are determined simultaneously.
Consider the demand and supply curve \[ \begin{aligned} q^{d} =\beta_{0}^{d}+\beta_{1}^{d}p+\beta_{2}^{d}x+u^{d} \\ q^{s} =\beta_{0}^{s}+\beta_{1}^{s}p+\beta_{2}^{s}z+u^{s} \end{aligned} \]
The equilibrium price and quantity are determined by \(q^{d}=q^{s}\).
In this case, \[ p=\frac{(\beta_{2}^{s}z-\beta_{2}^{d}z)+(\beta_{0}^{s}-\beta_{0}^{d})+(u^{s}-u^{d})}{\beta_{1}^{d}-\beta_{1}^{s}} \] implying the correlation between the price and the error term.
Putting this differently, the data points we observed is the intersection of these supply and demand curves.
How can we distinguish demand and supply?

12.3 Idea of IV Regression

Let’s start with a simple case. \[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i, \] and \(Cov(x_i, \epsilon_i) \neq 0\).
Now, we consider another variable \(z_i\), which we call instrumental variable (IV).
Instrumental variable \(z_i\) should satisfies the following two conditions:
1. Independence: \(Cov(z_i, \epsilon_i) = 0\). No correlation between IV and error.
2. Relevance: \(Cov(z_i, x_i) \neq 0\). There should be correlation between IV and endogenous variable \(x_i\).
Idea: Use the variation of \(x_i\) induced by instrument \(z_i\) to estimate the direct (causal) effect of \(x_i\) on \(y_i\), that is \(\beta_1\)!.
More on this:
1. Intuitively, the OLS estimator captures the correlation between \(x\) and \(y\).
2. If there is no correlation between \(x\) and \(\epsilon\), it captures the causal effect \(\beta_1\).
3. If not, the OLS estimator captures both direct and indirect effect, the latter of which is bias.
4. Now, let’s capture the variation of \(x\) due to instrument \(z\),
  - Such a variation should exist under relevance assumption.
  - Such a variation should not be correlated with the error under independence assumption
5. By looking at the correlation between such variation and \(y\), you can get the causal effect \(\beta_1\).

Idea IV

12.4 Formal Framework and Estimation

12.4.1 Model

We now introduce a general framework with multiple endogenous variables and multiple instruments.
Consider the model \[ \begin{aligned} Y_i = \beta_0 + \beta_1 X_{1i} + \dots + \beta_K X_{Ki} + \beta_{K+1} W_{1i} + \dots + \beta_{K+R} W_{Ri} + u_i, \end{aligned} \] with \(i=1,\dots,n\) is the general instrumental variables regression model where
- \(Y_i\) is the dependent variable
- \(\beta_0,\dots,\beta_{K+R}\) are \(1+K+R\) unknown regression coefficients
- \(X_{1i},\dots,X_{Ki}\) are \(K\) endogenous regressors: \(Cov(X_{ki}, u_i) \neq 0\) for all \(k\).
- \(W_{1i},\dots,W_{Ri}\) are \(R\) exogenous regressors which are uncorrelated with \(u_i\). \(Cov(W_{ri}, u_i) = 0\) for all \(r\).
- \(u_i\) is the error term
- \(Z_{1i},\dots,Z_{Mi}\) are \(M\) instrumental variables
I will discuss conditions for valid instruments later.

12.4.2 Estimation by Two Stage Least Squares (2SLS)

We can estimate the above model by Two Stage Least Squares (2SLS)
Step 1: First-stage regression(s)
- Run an OLS regression for each of the endogenous variables (\(X_{1i},\dots,X_{ki}\)) on all instrumental variables (\(Z_{1i},\dots,Z_{mi}\)), all exogenous variables (\(W_{1i},\dots,W_{ri}\)) and an intercept.
- Compute the fitted values (\(\widehat{X}_{1i},\dots,\widehat{X}_{ki}\)).
Step 2: Second-stage regression
- Regress the dependent variable \(Y_i\) on the predicted values of all endogenous regressors (\(\widehat{X}_{1i},\dots,\widehat{X}_{ki}\)), all exogenous variables (\(W_{1i},\dots,W_{ri}\)) and an intercept using OLS.
- This gives \(\widehat{\beta}_{0}^{TSLS},\dots,\widehat{\beta}_{k+r}^{TSLS}\), the 2SLS estimates of the model coefficients.

12.4.2.1 Intuition

Why does this work? Let’s go back to the simple example with 1 endogenous variable and 1 IV.
In the first stage, we estimate
\[ x_i = \pi_0 + \pi_1 z_i + v_i \] by OLS and obtain the fitted value \(\widehat{x}_i = \widehat{\pi}_0 + \widehat{\pi}_1 z_i\).
In the second stage, we estimate \[ y_i = \beta_0 + \beta_1 \widehat{x}_i + u_i \]
Since \(\widehat{x}_i\) depends only on \(z_i\), which is uncorrelated with \(u_i\), the second stage can estimate \(\beta_1\) without bias.
Can you see the importance of both independence and relevance asssumption here? (More formal discussion later)

12.4.3 Conditions for Valid IVs in a general framework

12.4.3.1 Necessary condition

Depending on the number of IVs, we have three cases
1. Over-identification: \(M > K\)
2. Just identification] \(M=K\)
3. Under-identification \(M < K\)
The necessary condition is \(M \geq K\).
- We should have more IVs than endogenous variables!!

12.4.3.2 Sufficient condition

How about sufficiency?
In a general framework, the sufficient condition for valid instruments is given as follows.
1. Independence: \(Cov( Z_{mi}, \epsilon_i) = 0\) for all \(m\).
2. Relevance: In the second stage regression, the variables \[ \left( \widehat{X}_{1i},\dots,\widehat{X}_{ki}, W_{1i},\dots,W_{ri}, 1 \right) \] are not perfectly multicollinear.
What does the relevance condition mean?
In the simple example above, The first stage is
\[ x_i = \pi_0 + \pi_1 z_i + v_i \] and the second stage is \[ y_i = \beta_0 + \beta_1 \widehat{x}_i + u_i \]
The second stage would have perfect multicollinarity if \(\pi_1 = 0\) (i.e., \(\widehat{x}_i = \pi_0\)).
Back to the general case, the first stage for \(X_k\) can be written as \[ X_{ki} = \pi_0 + \pi_1 Z_{1i} + \cdots + \pi_M Z_{Mi} + \pi_{M+1} W_{1i} + \cdots + \pi_{M+R} W_{Ri} \] and one of \(\pi_1 , \cdots, \pi_M\) should be non-zero.
Intuitively speaking, the instruments should be correlated with endogenous variables after controlling for exogenous variables

12.5 Check Instrument Validity

12.5.1 Relevance

Instruments are weak if those instruments explain little variation in the endogenous variables.
Weak instruments lead to
1. imprecise estimates (higher standard errors)
2. The asymptotic distribution would deviate from a normal distribution even if we have a large sample.
Here is a rule of thumb to check the relevance conditions.
Consider the case with one endogenous variable \(X_{1i}\).
The first stage regression
\[ X_k = \pi_0 + \pi_1 Z_{1i} + \cdots + \pi_M Z_{Mi} + \pi_{M+1} W_{1i} + \cdots + \pi_{M+R} W_{Ri} \]
And test the null hypothesis \[ \begin{aligned} H_0 & : \pi_1 = \cdots = \pi_M = 0 \\ H_1 & : otherwise \end{aligned} \]
- This is F test (test of joint hypothesis)
If we can reject this, we can say no concern for weak instruments.
A rule of thumbs is that the F statistic should be larger than 10.
12.5.1 Independence (Instrument exogeneity)
Arguing for independence is hard and a key in empirical analysis.
Justification of this assumption depends on a context, institutional features, etc…
We will see this through examples in the next chapter.

12 Instrumental Variable 1: Framework

12.1 Introduction: Endogeneity Problem and its Solution

12.2 Examples of Endogeneity Problem

12.2.1 More on Omitted Variable Bias

12.2.2 Measurement error

12.2.3 Simultaneity (or reverse causality)

12.3 Idea of IV Regression

12.4 Formal Framework and Estimation

12.4.1 Model

12.4.2 Estimation by Two Stage Least Squares (2SLS)

12.4.2.1 Intuition

12.4.3 Conditions for Valid IVs in a general framework

12.4.3.1 Necessary condition

12.4.3.2 Sufficient condition

12.5 Check Instrument Validity

12.5.1 Relevance

12.5.1 Independence (Instrument exogeneity)