8 Linear Regression 1: Theory
8.1 Regression framework
- Let \(Y_i\) be the dependent variable and \(X_{ik}\) be k-th explanatory variable.
- We have \(K\) explantory variables (along with constant term)
- \(i\) is an index for observations. \(i = 1,\cdots, N\).
- Data (sample): \(\{ Y_i , X_{i1}, \ldots, X_{iK} \}_{i=1}^N\)
- Linear regression model is defined as \[ Y_{i}=\beta_{0}+\beta_{1}X_{1i}+\cdots+\beta_{K}X_{Ki}+\epsilon_{i} \]
- \(\epsilon_i\): error term (unobserved)
- \(\beta\): coefficients
- Assumptions for Ordinaly Least Squares (OLS) estimation
- Random sample: \(\{ Y_i , X_{i1}, \ldots, X_{iK} \}\) is i.i.d. drawn sample
- i.i.d.: identically and independently distributed
- \(\epsilon_i\) has zero conditional mean \[ E[ \epsilon_i | X_{i1}, \ldots, X_{iK}] = 0 \]
- Large outliers are unlikely: The random variable \(Y_i\) and \(X_{ik}\) have finite fourth moments.
- No perfect multicollinearity: There is no linear relationship betwen explanatory variables.
- Random sample: \(\{ Y_i , X_{i1}, \ldots, X_{iK} \}\) is i.i.d. drawn sample
- OLS estimators are the minimizers of the sum of squared residuals: \[ \min_{\beta_0, \cdots, \beta_K} \frac{1}{N} \sum_{i=1}^N (Y_i - (\beta_0 + \beta_1 X_{i1} + \cdots + \beta_K X_{iK}))^2 \]
- Using matrix notation, we have the following analytical formula for the OLS estimator \[ \hat{\beta} = (X'X)^{-1} X'Y \] where \[ \underbrace{X}_{N\times (K+1)}=\left(\begin{array}{cccc} 1 & X_{11} & \cdots & X_{1K}\\ \vdots & \vdots & & \vdots\\ 1 & X_{N1} & \cdots & X_{NK} \end{array}\right),\underbrace{Y}_{N\times 1}=\left(\begin{array}{c} Y_{1}\\ \vdots\\ Y_{N} \end{array}\right),\underbrace{\beta}_{(K+1)\times 1}=\left(\begin{array}{c} \beta_{0}\\ \beta_{1}\\ \vdots\\ \beta_{K} \end{array}\right) \]
8.2 Theoretical Properties of OLS estimator
- We briefly review theoretical properties of OLS estimator.
- Unbiasdness: Conditional on the explantory variables \(X\), the expectation of the OLS estimator \(\hat{\beta}\) is equal to the true value \(\beta\). \[ E[\hat{\beta} | X] = \beta \]
- Consistency: As the sample size \(N\) goes to infinity, the OLS estimator \(\hat{\beta}\) converges to \(\beta\) in probability \[ \hat{\beta}\overset{p}{\longrightarrow}\beta \]
- Asymptotic normality: Will talk this later
8.3 Interpretation and Specifications of Linear Regression Model
- Remember that \[ Y_{i}=\beta_{0}+\beta_{1}X_{1i}+\cdots+\beta_{K}X_{Ki}+\epsilon_{i} \]
- The coefficient \(\beta_k\) captures the effect of \(X_k\) on \(Y\) ceteris paribus (all things being equal)
- Equivalently, \[ \frac{\partial Y}{\partial X_k} = \beta_k \] if \(X_k\) is continuous random variable.
- If we can estimate \(\beta_k\) without bias, we can obtain causal effect of \(X_k\) on \(Y\).
- This is of course very difficult task. We will see this more later.
- We will see several specifications that are frequently used in empirical analysis. 1. Nonlinear term 1. log specification 2. dummy (categorical) variables 3. interaction terms
8.3.1 Nonlinear term
- We can capture non-linear relationship between \(Y\) and \(X\) in a linearly additive form \[ Y_i = \beta_0 + \beta_1 X_i + \beta_2 X_i^2 + \beta_3 X_i^3 + \epsilon_i \]
- As long as the error term \(\epsilon_i\) appreas in a additively linear way, we can estimate the coefficients by OLS.
- Multicollinarity could be an issue if we have many polynomials (see later).
- You can use other non-linear variables such as \(log(x)\) and \(\sqrt{x}\).
8.3.2 log specification
- We often use
log
variables in both dependent and independent variables. - Using
log
changes the interpretation of the coefficient \(\beta\) in terms of scales.
Dependent variable | Explanatory variable | interpretation |
---|---|---|
\(Y\) | \(X\) | 1 unit increase in \(X\) causes \(\beta\) units change in Y |
\(\log Y\) | \(X\) | 1 unit increase in \(X\) causes \(100 \beta \%\) incchangerease in \(Y\) |
\(Y\) | \(\log X\) | \(1\%\) increase in \(X\) causes \(\beta / 100\) unit change in \(Y\) |
\(\log Y\) | \(\log X\) | \(1\%\) increase in \(X\) causes \(\beta \%\) change in \(Y\) |
8.3.3 Dummy variable
- A dummy variable takes only 1 or 0. This is used to express qualititative information
- Example: Dummy variable for race \[ white_{i}=\begin{cases} 1 & if\ white\\ 0 & otherwise \end{cases} \]
- The coefficient on a dummy variable captures the difference of the outcome \(Y\) between categories
- Consider the linear regression \[ Y_i = \beta_0 + \beta_1 white_i + \epsilon_i \] The coefficient \(\beta_1\) captures the difference of \(Y\) between white and non-white people.
8.3.4 Interaction term
- You can add the interaction of two explanatory variables in the regression model.
- For example: \[ wage_i = \beta_0 + \beta_1 educ_i + \beta_2 white_i + \beta_3 educ_i \times white_i + \epsilon_i \] where \(wage_i\) is the earnings of person \(i\) and \(educ_i\) is the years of schooling for person \(i\).
- The effect of \(educ_i\) is \[ \frac{\partial wage_i}{\partial educ_i} = \beta_1 + \beta_3 white_i, \]
- This allows for heterogenous effects of education across races.
8.4 Measures of Fit
- We often use \(R^2\) as a measure of the model fit.
- Denote the fitted value as \(\hat{y}_i\) \[
\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 X_{i1} + \cdots + \hat{\beta}_K X_{iK}
\]
- Also called prediction from the OLS regression.
- \(R^2\) is defined as \[ R^2 = \frac{SSE}{TSS}, \] where \[ \ SSE = \sum_i (\hat{y}_i - \bar{y})^2, \ TSS = \sum_i (y_i - \bar{y})^2 \]
- \(R^2\) captures the fraction of the variation of \(Y\) explained by the regression model.
- Adding variables always (weakly) increases \(R^2\).
- In a regression model with multiple explanatory variables, we often use adjusted \(R^2\) that adjusts the number of explanatory variables \[ \bar{R}^2 = 1 - \frac{N-1}{N-(K+1)} \frac{SSR}{TSS} \] where \[ SSR = \sum_i (\hat{y}_i - y_i)^2 (= \sum_i \hat{u}_i^2 ), \]
8.5 Statistical Inference
- Notice that the OLS estimators are random variables. They depend on the data, which are random variables drawn from some population distribution.
- We can conduct statistical inferences regarding those OLS estimators: 1. Hypothesis testing 2. Constructing confidence interval
- I first explain the sampling distribution of the OLS estimators.
8.5.1 Distribution of the OLS estimators based on asymptotic theory
- Deriving the exact (finite-sample) distribution of the OLS estimators is very hard.
- The OLS estimators depend on the data \(Y_i, X_i\) in a complex way.
- We typically do not know the distribution of \(Y\) and \(X\).
- We rely on asymptotic argument. We approximate the sampling distribution of the OLS esimator based on the cental limit theorem.
- Under the OLS assumption, the OLS estimator has asymptotic normality \[ \sqrt{N}(\hat{\beta}-\beta)\overset{d}{\rightarrow}N\left(0,V \right) \] where \[ \underbrace{V}_{(K+1)\times(K+1)} = E[\mathbf{x}_{i}'\mathbf{x}_{i}]^{-1}E[\mathbf{x}_{i}'\mathbf{x}_{i}\epsilon_{i}^{2}]E[\mathbf{x}_{i}'\mathbf{x}_{i}]^{-1} \] and \[ \underbrace{\mathbf{x}_{i}}_{(K+1)\times1}=\left(\begin{array}{c} 1\\ X_{i1}\\ \vdots\\ X_{iK} \end{array}\right) \]
- We can approximate the distribution of \(\hat{\beta}\) by \[ \hat{\beta} \sim N(\beta, V / N) \]
- The above is joint distribution. Let \(V_{ij}\) be the \((i,j)\) element of the matrix \(V\).
- The individual coefficient \(\beta_k\) follows \[ \hat\beta_k \sim N(\beta_k, V_{kk} / N ) \]
8.5.1.1 Estimation of Asymptotic Variance
- \(V\) is an unknown object. Need to be estimated.
- Consider the estimator \(\hat{V}\) for \(V\) using sample analogues \[ \hat{V}=\left(\frac{1}{N}\sum_{i=1}^{N}\mathbf{x}_{i}'\mathbf{x}_{i}\right)^{-1}\left(\frac{1}{N}\sum_{i=1}^{N}\mathbf{x}_{i}'\mathbf{x}_{i}\hat{\epsilon}_{i}^{2}\right)\left(\frac{1}{N}\sum_{i=1}^{N}\mathbf{x}_{i}'\mathbf{x}_{i}\right)^{-1} \] where \(\hat{\epsilon}_i = y_i - (\hat{\beta}_0 + \cdots + \hat{\beta}_K X_{iK})\) is the residual.
- Technically speaking, \(\hat{V}\) converges to \(V\) in probability. (Proof is out of the scope of this course)
- We often use the (asymptotic) standard error \(SE(\hat{\beta_k}) = \sqrt{\hat{V}_{kk} / N }\).
- The standard error is an estimator for the standard deviation of the OLS estimator \(\hat{\beta_k}\).
8.5.2 Hypothesis testing
- OLS estimator is the random variable.
- You might want to test a particular hypothesis regarding those coefficients.
- Does x really affects y?
- Is the production technology the constant returns to scale?
Here I explain how to conduct hypothesis testing.
- Step 1: Consider the null hypothesis \(H_{0}\) and the alternative hypothesis \(H_{1}\) \[ H_{0}:\beta_{1}=k,H_{1}:\beta_{1}\neq k \] where \(k\) is the known number you set by yourself.
Step 2: Define t-statistic by \[ t_{n}=\frac{\hat{\beta_1}-k}{SE(\hat{\beta_1})} \]
- Step 3: We reject \(H_{0}\) is at \(\alpha\)-percent significance level if \[|t_{n}|>C_{\alpha/2}
\] where \(C_{\alpha/2}\) is the \(\alpha/2\) percentile of the standard normal distribution.
- We say we fail to reject \(H_0\) if the above does not hold.
8.5.2.1 Caveats on Hypothesis Testing
- We often say \(\hat{\beta}\) is statistically significant at \(5\%\) level if \(|t_{n}|>1.96\) when we set \(k=0\).
- Arguing the statistical significance alone is not enough for argument in empirical analysis.
- Magnitude of the coefficient is also important.
- Case 1: Small but statistically significant coefficient.
- As the sample size \(N\) gets large, the \(SE\) decreases.
- Case 2: Large but statistically insignificant coefficient.
- The variable might have an important (economically meaningful) effect.
- But you may not be able to estimate the effect precisely with the sample at your hand.
8.5.2.2 F test
- We often test a composite hypothesis that involves multiple parameters such as \[ H_{0}:\beta_{1} + \beta_2 = 0,\ H_{1}:\beta_{1} + \beta_2 \neq 0 \]
- We use F test in such a case (to be added).
8.5.3 Confidence interval
- 95% confidence interval \[ CI_{n} =\left\{ k:|\frac{\hat{\beta}_{1}-k}{SE(\hat{\beta}_{1})}|\leq1.96\right\} =\left[\hat{\beta}_{1}-1.96\times SE(\hat{\beta}_{1}),\hat{\beta_{1}}+1.96\times SE(\hat{\beta}_{1})\right] \]
- Interpretation: If you draw many samples (dataset) and construct the 95% CI for each sample, 95% of those CIs will include the true parameter.
8.5.4 Homoskedasticity vs Heteroskedasticity
- So far, we did not put any assumption on the variance of the error term \(\epsilon_i\).
- The error term \(\epsilon_{i}\) has heteroskedasticity if \(Var(u_{i}|X_{i})\) depends on \(X_{i}\).
- If not, we call \(\epsilon_{i}\) has homoskedasticity.
- This has an important implication on the asymptotic variance.
- Remember the asymptotic variance \[ \underbrace{V}_{(K+1)\times(K+1)} = E[\mathbf{x}_{i}'\mathbf{x}_{i}]^{-1}E[\mathbf{x}_{i}'\mathbf{x}_{i}\epsilon_{i}^{2}]E[\mathbf{x}_{i}'\mathbf{x}_{i}]^{-1} \] Standard errors based on this is called heteroskedasticity robust standard errors/
- If homoskedasticity holds, then \[ V = E[\mathbf{x}_{i}'\mathbf{x}_{i}]^{-1}\sigma^{2} \] where \(\sigma^2 = V(\epsilon_i)\).
- In many statistical packages (including R and Stata), the standard errors for the OLS estimators are calcualted under homoskedasticity assumption as a default.
- However, if the error has heteroskedasticity, the standard error under homoskedasticity assumption will be underestimated.
- In OLS, we should always use heteroskedasticity robust standard error.
- We will see how to fix this in R.