10 Linear Regression 3: Discussions on OLS Assumptions
10.1 Introduction
- Remember that we have four assumptions in OLS estimation
- Random sample: \(\{ Y_i , X_{i1}, \ldots, X_{iK} \}\) is i.i.d. drawn sample - i.i.d.: identically and independently distributed
- \(\epsilon_i\) has zero conditional mean \[
E[ \epsilon_i | X_{i1}, \ldots, X_{iK}] = 0
\]
- This implies \(Cov(X_{ik}, \epsilon_i) = 0\) for all \(k\). (or \(E[\epsilon_i X_{ik}] = 0\))
- No correlation between error term and explanatory variables.
- Large outliers are unlikely:
- The random variable \(Y_i\) and \(X_{ik}\) have finite fourth moments.
- No perfect multicollinearity:
- There is no linear relationship betwen explanatory variables.
- The OLS estimator has ideal properties (consistency, asymptotic normality, unbiasdness) under these assumptions.
- In this chapter, we study the role of these assumptions.
- In particular, we focus on the following two assumptions
- No correlation between \(\epsilon_{it}\) and \(X_{ik}\)
- No perfect multicollinearity
10.2 Endogeneity problem
- When \(Cov(x_k, \epsilon)=0\) does not hold, we have endogeneity problem
- We call such \(x_k\) an endogenous variable.
- There are several cases in which we have endogeneity problem
- Omitted variable bias
- Measurement error
- Simultaneity
- Sample selection
- Here, I focus on the omitted variable bias.
10.2.1 Omitted variable bias
- Consider the wage regression equation (true model) \[ \begin{aligned} \log W_{i} &=& & \beta_{0}+\beta_{1}S_{i}+\beta_{2}A_{i}+u_{i} \\ E[u_{i}|S_{i},A_{i}] &=& & 0 \end{aligned} \] where \(W_{i}\) is wage, \(S_{i}\) is the years of schooling, and \(A_{i}\) is the ability.
- What we want to know is \(\beta_1\), the effect of the schooling on the wage holding other things fixed. Also called the returns from education.
- An issue is that we do not often observe the ability of a person directly.
- Suppose that you omit \(A_i\) and run the following regression instead. \[ \log W_{i} = \alpha_{0}+\alpha_{1} S_{i} + v_i \] - Notice that \(v_i = \beta_2 A_i + u_i\), so that \(S_i\) and \(v_i\) is likely to be correlated.
- The OLS estimator \(\hat\alpha_1\) will have the bias: \[
E[\hat\alpha_1] = \beta_1 + \beta_2\frac{Cov(S_i, A_i)}{Var(S_i)} \]
- You can also say \(\hat\alpha_1\) is not consistent for \(\beta_1\), i.e., \[ \hat{\alpha}_{1}\overset{p}{\longrightarrow}\beta_{1}+\beta_{2}\frac{Cov(S_{i},A_{i})}{Var(S_{i})} \]
- This is known as omitted variable bias formula.
- Omitted variable bias depends on 1. The effect of the omitted variable (\(A_i\) here) on the dependent variable: \(\beta_2\) 2. Correlation between the omitted variable and the explanatory variable.
- This is super-important: You can make a guess regarding the direction and the magnitude of the bias!!
- This is crucial when you read an empirical paper and do am empirical exercise.
Here is the summary table - \(x_1\): included, \(x_2\) omitted. \(\beta_2\) is the coefficient on \(x_2\).
\(Cov(x_1, x_2) > 0\) \(Cov(x_1, x_2) < 0\) \(\beta_2 > 0\) Positive bias Negative bias \(\beta_2 < 0\) Negative bias Positive bias
10.2.2 Correlation v.s. Causality
- Omitted variable bias is related to a well-known argument of “Correlation or Causality”.
- Example: Does the education indeed affect your wage, or the unobserved ability affects both the ducation and the wage, leading to correlation between education and wage?
- See my lecture note from Intermediate Seminar (Fall 2018) for the details.
10.3 Multicollinearity issue
10.3.1 Perfect Multicollinearity
- If one of the explanatory variables is a linear combination of other variables, we have perfect multicolinearity.
- In this case, you cannot estimate all the coefficients.
- For example, \[ y_i = \beta_0 + \beta_1 x_1 + \beta_2\cdot x_2 + \epsilon_i \] and \(x_2 = 2x_1\).
- These explanatory variables are collinear. You are not able to estimate both \(\beta_1\) and \(\beta_2\).
- To see this, the above model can be written as \[ y_i = \beta_0 + \beta_1 x_1 + \beta_2\cdot2x_1 + \epsilon_i \\ \] and this is the same as \[ y_i = \beta_0 + (\beta_1 + 2 \beta_2 ) x_1 + \epsilon_i \\ \]
- You can estimate the composite term \(\beta_1 + 2 \beta_2\) as a coefficient on \(x_1\), but not \(\beta_1\) and \(\beta_2\) separately.
10.3.1.1 Some Intuition
- Intuitively speaking, the regression coefficients are estimated by capturing how the variation of the explanatory variable \(x\) affects the variation of the dependent variable \(y\)
- Since \(x_1\) and \(x_2\) are moving together completely, we cannot say how much the variation of \(y\) is due to \(x_1\) or \(x_2\), so that \(\beta_1\) and \(\beta_2\).
10.3.1.2 Dummy variable
- Consider the dummy variables that indicate male and famale. \[ male_{i}=\begin{cases} 1 & if\ male\\ 0 & if\ female \end{cases},\ female_{i}=\begin{cases} 1 & if\ female\\ 0 & if\ male \end{cases} \]
- If you put both male and female dummies into the regression, \[ y_i = \beta_0 + \beta_1 famale_i + \beta_2 male_i + \epsilon_i \]
- Since \(male_i + famale_i = 1\) for all \(i\), we have perfect multicolinarity.
- You should always omit the dummy variable of one of the groups in the linear regression.
- For example, \[ y_i = \beta_0 + \beta_1 famale_i + \epsilon_i \]
- In this case, \(\beta_1\) is interpreted as the effect of being famale in comparison with male.
- The omitted group is the basis for the comparison.
- You should the same thing when you deal with multiple groups such as \[ \begin{aligned} freshman_{i}&=&\begin{cases} 1 & if\ freshman\\ 0 & otherwise \end{cases} \\ sophomore_{i}&=&\begin{cases} 1 & if\ sophomore\\ 0 & otherwise \end{cases} \\ junior_{i}&=&\begin{cases} 1 & if\ junior\\ 0 & otherwise \end{cases} \\ senior_{i}&=&\begin{cases} 1 & if\ senior\\ 0 & otherwise \end{cases} \end{aligned} \] and \[ y_i = \beta_0 + \beta_1 freshman_i + \beta_2 sophomore_i + \beta_3 junior_i + \epsilon_i \]
10.3.2 Imperfect multicollinearity.
- Though not perfectly co-linear, the correlation between explanatory variables might be very high, which we call imperfect multicollinearity.
- How does this affect the OLS estimator?
- To see this, we consider the following simple model (with homoskedasticity) \[ y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \epsilon_i, V(\epsilon_i) = \sigma_2 \]
- You can show that the conditional variance (not asymptotic variance) is given by \[ V(\hat\beta_1 | X) = \frac{\sigma^{2}}{N\cdot\hat{V}(x_{1i})\cdot(1-R_{1}^{2})} \] where \(\hat V(x_{1i})\) is the sample variance \[ \hat V(x_{1i}) =\frac{1}{N}\sum(x_{1i}-\bar{x_{1}})^{2} \] and \(R_{1}^{2}\) is the R-squared in the following regression of \(x_2\) on \(x_1\). \[ x_{1i} = \pi_0 + \pi_1 x_{2i} + u_i \]
- You can see that the variance of the OLS estimator \(\hat{\beta}_{1}\) is small if
- \(N\) is large (i.e., more observations!)
- \(\hat V(x_{1i})\) is large (more variation in \(x_{1i}\)!)
- \(R_{1}^{2}\) is small.
- Here, high \(R_{1}^{2}\) means that \(x_{1i}\) is explained well by other variables in a linear way. – The extreme case is \(R_{1}^{2}=1\), that is \(x_{1i}\) is the linear combination of other variables, implying perfect multicolinearity!!
10.4 Lesson for an empirical analysis
- We often say the variation of the variable of interest is important in an empirical analysis.
- This has two meanings:
- exogenous variation (i.e., uncorrelated with error term)
- large variance
- The former is a key for mean independence assumption.
The latter is a key for precise estimation (smaller standard error).
- If we have more variation, the standard error of the OLS estimator is small, meaning that we can precisely estimate the coefficient.
- The variation of the variable after controlling for other factors that affects \(y\) is also crucial (corresponding to \(1-R_1^2\) above).
- If you do not include other variables (say \(x_2\) above), you will have omitted variable bias.
- To address research questions using data, it is important to find a good variation of the explanatory variable that you want to focus on. This is often called identification strategy.
- Identification strategy is context-specific. To have a good identification strategy, you should be familiar with the background knowledge of your study.