10 Linear Regression 3: Discussions on OLS Assumptions

10.1 Introduction

Remember that we have four assumptions in OLS estimation

Random sample: \(\{ Y_i , X_{i1}, \ldots, X_{iK} \}\) is i.i.d. drawn sample - i.i.d.: identically and independently distributed
\(\epsilon_i\) has zero conditional mean \[ E[ \epsilon_i | X_{i1}, \ldots, X_{iK}] = 0 \]
- This implies \(Cov(X_{ik}, \epsilon_i) = 0\) for all \(k\). (or \(E[\epsilon_i X_{ik}] = 0\))
- No correlation between error term and explanatory variables.
Large outliers are unlikely:
- The random variable \(Y_i\) and \(X_{ik}\) have finite fourth moments.
No perfect multicollinearity:
- There is no linear relationship betwen explanatory variables.

The OLS estimator has ideal properties (consistency, asymptotic normality, unbiasdness) under these assumptions.
In this chapter, we study the role of these assumptions.
In particular, we focus on the following two assumptions
1. No correlation between \(\epsilon_{it}\) and \(X_{ik}\)
2. No perfect multicollinearity

10.2 Endogeneity problem

When \(Cov(x_k, \epsilon)=0\) does not hold, we have endogeneity problem
- We call such \(x_k\) an endogenous variable.
There are several cases in which we have endogeneity problem
1. Omitted variable bias
2. Measurement error
3. Simultaneity
4. Sample selection
Here, I focus on the omitted variable bias.

10.2.1 Omitted variable bias

Consider the wage regression equation (true model) \[ \begin{aligned} \log W_{i} &=& & \beta_{0}+\beta_{1}S_{i}+\beta_{2}A_{i}+u_{i} \\ E[u_{i}|S_{i},A_{i}] &=& & 0 \end{aligned} \] where \(W_{i}\) is wage, \(S_{i}\) is the years of schooling, and \(A_{i}\) is the ability.
What we want to know is \(\beta_1\), the effect of the schooling on the wage holding other things fixed. Also called the returns from education.
An issue is that we do not often observe the ability of a person directly.
Suppose that you omit \(A_i\) and run the following regression instead. \[ \log W_{i} = \alpha_{0}+\alpha_{1} S_{i} + v_i \] - Notice that \(v_i = \beta_2 A_i + u_i\), so that \(S_i\) and \(v_i\) is likely to be correlated.
The OLS estimator \(\hat\alpha_1\) will have the bias: \[ E[\hat\alpha_1] = \beta_1 + \beta_2\frac{Cov(S_i, A_i)}{Var(S_i)} \]
- You can also say \(\hat\alpha_1\) is not consistent for \(\beta_1\), i.e., \[ \hat{\alpha}_{1}\overset{p}{\longrightarrow}\beta_{1}+\beta_{2}\frac{Cov(S_{i},A_{i})}{Var(S_{i})} \]
This is known as omitted variable bias formula.
Omitted variable bias depends on 1. The effect of the omitted variable (\(A_i\) here) on the dependent variable: \(\beta_2\) 2. Correlation between the omitted variable and the explanatory variable.
This is super-important: You can make a guess regarding the direction and the magnitude of the bias!!
This is crucial when you read an empirical paper and do am empirical exercise.
Here is the summary table - \(x_1\): included, \(x_2\) omitted. \(\beta_2\) is the coefficient on \(x_2\).

\(Cov(x_1, x_2) > 0\) \(Cov(x_1, x_2) < 0\)

\(\beta_2 > 0\) Positive bias Negative bias

\(\beta_2 < 0\) Negative bias Positive bias

	\(Cov(x_1, x_2) > 0\)	\(Cov(x_1, x_2) < 0\)
\(\beta_2 > 0\)	Positive bias	Negative bias
\(\beta_2 < 0\)	Negative bias	Positive bias

10.2.2 Correlation v.s. Causality

Omitted variable bias is related to a well-known argument of “Correlation or Causality”.
Example: Does the education indeed affect your wage, or the unobserved ability affects both the ducation and the wage, leading to correlation between education and wage?
See my lecture note from Intermediate Seminar (Fall 2018) for the details.

10.3 Multicollinearity issue

10.3.1 Perfect Multicollinearity

If one of the explanatory variables is a linear combination of other variables, we have perfect multicolinearity.
In this case, you cannot estimate all the coefficients.
For example, \[ y_i = \beta_0 + \beta_1 x_1 + \beta_2\cdot x_2 + \epsilon_i \] and \(x_2 = 2x_1\).
These explanatory variables are collinear. You are not able to estimate both \(\beta_1\) and \(\beta_2\).
To see this, the above model can be written as \[ y_i = \beta_0 + \beta_1 x_1 + \beta_2\cdot2x_1 + \epsilon_i \\ \] and this is the same as \[ y_i = \beta_0 + (\beta_1 + 2 \beta_2 ) x_1 + \epsilon_i \\ \]
You can estimate the composite term \(\beta_1 + 2 \beta_2\) as a coefficient on \(x_1\), but not \(\beta_1\) and \(\beta_2\) separately.

10.3.1.1 Some Intuition

Intuitively speaking, the regression coefficients are estimated by capturing how the variation of the explanatory variable \(x\) affects the variation of the dependent variable \(y\)
Since \(x_1\) and \(x_2\) are moving together completely, we cannot say how much the variation of \(y\) is due to \(x_1\) or \(x_2\), so that \(\beta_1\) and \(\beta_2\).

10.3.1.2 Dummy variable

Consider the dummy variables that indicate male and famale. \[ male_{i}=\begin{cases} 1 & if\ male\\ 0 & if\ female \end{cases},\ female_{i}=\begin{cases} 1 & if\ female\\ 0 & if\ male \end{cases} \]
If you put both male and female dummies into the regression, \[ y_i = \beta_0 + \beta_1 famale_i + \beta_2 male_i + \epsilon_i \]
Since \(male_i + famale_i = 1\) for all \(i\), we have perfect multicolinarity.
You should always omit the dummy variable of one of the groups in the linear regression.
For example, \[ y_i = \beta_0 + \beta_1 famale_i + \epsilon_i \]
In this case, \(\beta_1\) is interpreted as the effect of being famale in comparison with male.
- The omitted group is the basis for the comparison.
You should the same thing when you deal with multiple groups such as \[ \begin{aligned} freshman_{i}&=&\begin{cases} 1 & if\ freshman\\ 0 & otherwise \end{cases} \\ sophomore_{i}&=&\begin{cases} 1 & if\ sophomore\\ 0 & otherwise \end{cases} \\ junior_{i}&=&\begin{cases} 1 & if\ junior\\ 0 & otherwise \end{cases} \\ senior_{i}&=&\begin{cases} 1 & if\ senior\\ 0 & otherwise \end{cases} \end{aligned} \] and \[ y_i = \beta_0 + \beta_1 freshman_i + \beta_2 sophomore_i + \beta_3 junior_i + \epsilon_i \]

10.3.2 Imperfect multicollinearity.

Though not perfectly co-linear, the correlation between explanatory variables might be very high, which we call imperfect multicollinearity.
How does this affect the OLS estimator?
To see this, we consider the following simple model (with homoskedasticity) \[ y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \epsilon_i, V(\epsilon_i) = \sigma_2 \]
You can show that the conditional variance (not asymptotic variance) is given by \[ V(\hat\beta_1 | X) = \frac{\sigma^{2}}{N\cdot\hat{V}(x_{1i})\cdot(1-R_{1}^{2})} \] where \(\hat V(x_{1i})\) is the sample variance \[ \hat V(x_{1i}) =\frac{1}{N}\sum(x_{1i}-\bar{x_{1}})^{2} \] and \(R_{1}^{2}\) is the R-squared in the following regression of \(x_2\) on \(x_1\). \[ x_{1i} = \pi_0 + \pi_1 x_{2i} + u_i \]
You can see that the variance of the OLS estimator \(\hat{\beta}_{1}\) is small if
1. \(N\) is large (i.e., more observations!)
2. \(\hat V(x_{1i})\) is large (more variation in \(x_{1i}\)!)
3. \(R_{1}^{2}\) is small.
Here, high \(R_{1}^{2}\) means that \(x_{1i}\) is explained well by other variables in a linear way. – The extreme case is \(R_{1}^{2}=1\), that is \(x_{1i}\) is the linear combination of other variables, implying perfect multicolinearity!!

10.4 Lesson for an empirical analysis

We often say the variation of the variable of interest is important in an empirical analysis.
This has two meanings:
1. exogenous variation (i.e., uncorrelated with error term)
2. large variance
The former is a key for mean independence assumption.
The latter is a key for precise estimation (smaller standard error).
If we have more variation, the standard error of the OLS estimator is small, meaning that we can precisely estimate the coefficient.
The variation of the variable after controlling for other factors that affects \(y\) is also crucial (corresponding to \(1-R_1^2\) above).
- If you do not include other variables (say \(x_2\) above), you will have omitted variable bias.
To address research questions using data, it is important to find a good variation of the explanatory variable that you want to focus on. This is often called identification strategy.
- Identification strategy is context-specific. To have a good identification strategy, you should be familiar with the background knowledge of your study.