1 Introduction

1.1 OLS Assumptions

\[ Y_i = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_K X_{iK} + \epsilon_i \]

Random sample: \(\{ Y_i , X_{i1}, \ldots, X_{iK} \}\) is i.i.d. drawn sample
- i.i.d.: identically and independently distributed
\(\epsilon_i\) has zero conditional mean \[ E[ \epsilon_i | X_{i1}, \ldots, X_{iK}] = 0 \]
- This implies \(Cov(X_{ik}, \epsilon_i) = 0\) for all \(k\). (or \(E[\epsilon_i X_{ik}] = 0\))
- No correlation between error term and explanatory variables.
Large outliers are unlikely:
- The random variable \(Y_i\) and \(X_{ik}\) have finite fourth moments.
No perfect multicollinearity:
- There is no linear relationship betwen explanatory variables.

The OLS estimator has ideal properties (consistency, asymptotic normality, unbiasdness) under these assumptions.
In this chapter, we study the role of these assumptions.
In particular, we focus on the following two assumptions
1. No correlation between \(\epsilon_{it}\) and \(X_{ik}\)
2. No perfect multicollinearity

2 Endogeneity

2.1 Endogeneity problem

When \(Cov(x_k, \epsilon)=0\) does not hold, we have endogeneity problem
- We call such \(x_k\) an endogenous variable.
There are several cases in which we have endogeneity problem
1. Omitted variable bias
2. Measurement error
3. Simultaneity
4. Sample selection
Here, I focus on the omitted variable bias.

2.2 Omitted variable bias

Consider the wage regression equation (true model) \[ \begin{aligned} \log W_{i} &=& & \beta_{0}+\beta_{1}S_{i}+\beta_{2}A_{i}+u_{i} \\ E[u_{i}|S_{i},A_{i}] &=& & 0 \end{aligned} \] where \(W_{i}\) is wage, \(S_{i}\) is the years of schooling, and \(A_{i}\) is the ability.
What we want to know is \(\beta_1\), the effect of the schooling on the wage holding other things fixed. Also called the returns from education.
An issue is that we do not often observe the ability of a person directly.

Suppose that you omit \(A_i\) and run the following regression instead. \[ \log W_{i} = \alpha_{0}+\alpha_{1} S_{i} + v_i \]
- Notice that \(v_i = \beta_2 A_i + u_i\), so that \(S_i\) and \(v_i\) is likely to be correlated.
The OLS estimator \(\hat\alpha_1\) will have the bias: \[ E[\hat\alpha_1] = \beta_1 + \beta_2\frac{Cov(S_i, A_i)}{Var(S_i)} \]
You can also say \(\hat\alpha_1\) is not consistent for \(\beta_1\), i.e., \[ \hat{\alpha}_{1}\overset{p}{\longrightarrow}\beta_{1}+\beta_{2}\frac{Cov(S_{i},A_{i})}{Var(S_{i})} \]

2.3 omitted variable bias formula

Omitted variable bias depends on
1. The effect of the omitted variable (\(A_i\) here) on the dependent variable: \(\beta_2\)
2. Correlation between the omitted variable and the explanatory variable.
Summary table
- \(x_1\): included, \(x_2\) omitted. \(\beta_2\) is the coefficient on \(x_2\).
\(Cov(x_1, x_2) > 0\) \(Cov(x_1, x_2) < 0\)

\(\beta_2 > 0\) Positive bias Negative bias

\(\beta_2 < 0\) Negative bias Positive bias
Can make a guess about the direction of the bias!!
Crucial when reading an empirical paper and doing an empirical analysis.

	\(Cov(x_1, x_2) > 0\)	\(Cov(x_1, x_2) < 0\)
\(\beta_2 > 0\)	Positive bias	Negative bias
\(\beta_2 < 0\)	Negative bias	Positive bias

2.4 Correlation v.s. Causality

Omitted variable bias is related to a well-known argument of “Correlation or Causality”.
Example: Does the education indeed affect your wage, or the unobserved ability affects both the ducation and the wage, leading to correlation between education and wage?

3 Multicollinearity issue

3.1 Perfect Multicollinearity

Perfect multicolinearity: One of the explanatory variable is a linear combination of other variables.
- In this case, you cannot estimate all the coefficients.
For example, \[ y_i = \beta_0 + \beta_1 x_1 + \beta_2\cdot x_2 + \epsilon_i \] and \(x_2 = 2x_1\).
Cannot estimate both \(\beta_1\) and \(\beta_2\).

3.2 Some Intuition

Intuitively speaking, the regression coefficients are estimated by capturing how the variation of the explanatory variable \(x\) affects the variation of the dependent variable \(y\)
Since \(x_1\) and \(x_2\) are moving together completely, we cannot say how much the variation of \(y\) is due to \(x_1\) or \(x_2\), so that \(\beta_1\) and \(\beta_2\).

3.3 Example: Dummy variable

Consider the dummy variables that indicate male and famale. \[ male_{i}=\begin{cases} 1 & if\ male\\ 0 & if\ female \end{cases},\ female_{i}=\begin{cases} 1 & if\ female\\ 0 & if\ male \end{cases} \]
If you put both male and female dummies into the regression, \[ y_i = \beta_0 + \beta_1 famale_i + \beta_2 male_i + \epsilon_i \]
Since \(male_i + famale_i = 1\) for all \(i\), we have perfect multicolinarity.

You should always omit the dummy variable of one of the groups.
For example, \[ y_i = \beta_0 + \beta_1 famale_i + \epsilon_i \]
In this case, \(\beta_1\) is interpreted as the effect of being famale in comparison with male.
- The omitted group is the basis for the comparison.

You should the same thing when you deal with multiple groups such as \[ \begin{aligned} freshman_{i}&= \begin{cases} 1 & if\ freshman\\ 0 & otherwise \end{cases} \\ sophomore_{i}&= \begin{cases} 1 & if\ sophomore\\ 0 & otherwise \end{cases} \\ junior_{i}&= \begin{cases} 1 & if\ junior\\ 0 & otherwise \end{cases} \\ senior_{i}&= \begin{cases} 1 & if\ senior\\ 0 & otherwise \end{cases} \end{aligned} \] and \[ y_i = \beta_0 + \beta_1 freshman_i + \beta_2 sophomore_i + \beta_3 junior_i + \epsilon_i \]

3.4 Imperfect multicollinearity.

Though not perfectly co-linear, the correlation between explanatory variables might be very high, which we call imperfect multicollinearity.
How does this affect the OLS estimator?
To see this, we consider the following simple model (with homoskedasticity) \[ y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \epsilon_i, V(\epsilon_i) = \sigma^2 \]

You can show that the conditional variance (not asymptotic variance) is given by \[ V(\hat\beta_1 | X) = \frac{\sigma^{2}}{N\cdot\hat{V}(x_{1i})\cdot(1-R_{1}^{2})} \] where \(\hat V(x_{1i})\) is the sample variance \[ \hat V(x_{1i}) =\frac{1}{N}\sum(x_{1i}-\bar{x_{1}})^{2} \] and \(R_{1}^{2}\) is the R-squared in the following regression of \(x_2\) on \(x_1\). \[ x_{1i} = \pi_0 + \pi_1 x_{2i} + u_i \]

The variance of the OLS estimator \(\hat{\beta}_{1}\) is small if
1. \(N\) is large (i.e., more observations!)
2. \(\hat V(x_{1i})\) is large (more variation in \(x_{1i}\)!)
3. \(R_{1}^{2}\) is small.
Here, high \(R_{1}^{2}\) means that \(x_{1i}\) is explained well by other variables in a linear way.
- The extreme case is \(R_{1}^{2}=1\), that is \(x_{1i}\) is the linear combination of other variables, implying perfect multicolinearity!!

4 Research Design, Identification Strategy

4.1 Guide for causal analysis.

Suppose that you want to know the causal effect of \(X\) on \(Y\)
The variation of the variable of interest \(X\) is important.
Two meanings:
1. exogenous variation (i.e., uncorrelated with error term)
2. large variance of the variable
The former is a key for mean independence assumption (no bias).
The latter is a key for precise estimation (smaller standard error).

4.2 Point 1: Exogeneity of \(X\)

Mean independence is a key for unbiased estimation.
Hard to argue, as we have to discuss about unobserved factors.
Strategy 1: Add control variables
- The variable of interest should be uncorrelated with the error conditional on other variables (confounders).
- How many variables do we need to add?

Strategy 2: Find exogenous variation.
- Randomized control trial (field experiment)
- Natural experiment: The variable of interest determined as if it were in experiment.
- Instrumental variable estimation: Another variable \(Z\) that is exogenous.

4.3 Point 2: Enough variation of \(X\).

With more variation in \(X\), can precisely estimate the coefficient.
The variation of the variable after controlling for other factors that affects \(y\) is also crucial
- Remember \(1-R_1^2\) above.
If you include many control variables to deal with the omitted variable bias, you may end up having no independent variation of \(X\).
In such case, you cannot estimate the effect of \(X\) from the data.

4.4 Summary

To address research questions using data, it is important to find a good variation of the explanatory variable that you want to focus on.
This is often called identification strategy or research design.
Identification strategy is context-specific. You should be familiar with the background knowledge of your study.

Regression 3: Discussions on OLS Assumptions

Instructor: Yuta Toyama

Last updated: 2020-06-10

1 Introduction

1.1 OLS Assumptions

2 Endogeneity

2.1 Endogeneity problem

2.2 Omitted variable bias

2.3 omitted variable bias formula

2.4 Correlation v.s. Causality

3 Multicollinearity issue

3.1 Perfect Multicollinearity

3.2 Some Intuition

3.3 Example: Dummy variable

3.4 Imperfect multicollinearity.

4 Research Design, Identification Strategy

4.1 Guide for causal analysis.

4.2 Point 1: Exogeneity of \(X\)

4.3 Point 2: Enough variation of \(X\).

4.4 Summary