Regression 2: Implementation in R

class: center, middle, inverse, title-slide

# Regression 2: Implementation in R
### Instructor: Yuta Toyama
### Last updated: 2021-06-22

---

class: title-slide-section, center, middle
name: logistics
# Introduction

---
## Acknowledgement
This note is based on  "Introduction to Econometrics with R". https://www.econometrics-with-r.org/index.html

---
## Preliminary: packages

- We use the following packages:
    - `AER` : 
    - `dplyr` : data manipulation
    - `stargazer` : output of regression results

```r
# Install package if you have not done so 
# install.packages("AER")
# install.packages("dplyr")
# install.packages("stargazer")
# install.packages("texreg")
# install.packages("estimatr")

# load packages
library("AER")
library("dplyr")
library("stargazer")
library("texreg")
library("estimatr")
```

---
## Empirical setting: Data from California School

- Question: How does the student-teacher ratio affects test scores?
- We use data from California school, which is included in `AER` package. 
    - See here for the details: https://www.rdocumentation.org/packages/AER/versions/1.2-6/topics/CASchools

```r
# load the the data set in the workspace
data(CASchools)
```

---

- Use `class()` function to see `CASchools` is `data.frame` object.

```r
class(CASchools)
```

```
## [1] "data.frame"
```

- We take 2 steps for the analysis.
    - Step 1: Look at data (descriptive analysis) 
    - Step 2: Run regression

---
class: title-slide-section, center, middle
name: logistics
# Step 1: Descriptive analysis
---
### Descriptive analysis

- It is always important to grasp your data before running regression. 
- `head()` function give you a first overview of the data.

```r
head(CASchools)
```

```
##   district                          school  county grades students teachers calworks   lunch computer expenditure    income   english  read  math
## 1    75119              Sunol Glen Unified Alameda  KK-08      195    10.90   0.5102  2.0408       67    6384.911 22.690001  0.000000 691.6 690.0
## 2    61499            Manzanita Elementary   Butte  KK-08      240    11.15  15.4167 47.9167      101    5099.381  9.824000  4.583333 660.5 661.9
## 3    61549     Thermalito Union Elementary   Butte  KK-08     1550    82.90  55.0323 76.3226      169    5501.955  8.978000 30.000002 636.3 650.9
## 4    61457 Golden Feather Union Elementary   Butte  KK-08      243    14.00  36.4754 77.0492       85    7101.831  8.978000  0.000000 651.9 643.5
## 5    61523        Palermo Union Elementary   Butte  KK-08     1335    71.50  33.1086 78.4270      171    5235.988  9.080333 13.857677 641.8 639.9
## 6    62042         Burrel Union Elementary  Fresno  KK-08      137     6.40  12.3188 86.9565       25    5580.147 10.415000 12.408759 605.7 605.4
```

- Alternatively, you can use `View()` to see the entire dataset in browser window.

---
## Create variables

- Create several variables that are needed for the analysis.
- We use `dplyr` for this purpose.

```r
CASchools %>% 
  mutate( STR = students / teachers ) %>% 
  mutate( score = (read + math) / 2 ) -> CASchools 
```

---
## Descriptive statistics

- There are several ways to show descriptive statistics
- The standard one is to use `summary()` function

```r
summary(CASchools)
```

---

```
##    district            school                  county      grades       students          teachers          calworks          lunch           computer     
##  Length:420         Length:420         Sonoma     : 29   KK-06: 61   Min.   :   81.0   Min.   :   4.85   Min.   : 0.000   Min.   :  0.00   Min.   :   0.0  
##  Class :character   Class :character   Kern       : 27   KK-08:359   1st Qu.:  379.0   1st Qu.:  19.66   1st Qu.: 4.395   1st Qu.: 23.28   1st Qu.:  46.0  
##  Mode  :character   Mode  :character   Los Angeles: 27               Median :  950.5   Median :  48.56   Median :10.520   Median : 41.75   Median : 117.5  
##                                        Tulare     : 24               Mean   : 2628.8   Mean   : 129.07   Mean   :13.246   Mean   : 44.71   Mean   : 303.4  
##                                        San Diego  : 21               3rd Qu.: 3008.0   3rd Qu.: 146.35   3rd Qu.:18.981   3rd Qu.: 66.86   3rd Qu.: 375.2  
##                                        Santa Clara: 20               Max.   :27176.0   Max.   :1429.00   Max.   :78.994   Max.   :100.00   Max.   :3324.0  
##                                        (Other)    :272                                                                                                     
##   expenditure       income          english            read            math            STR            score      
##  Min.   :3926   Min.   : 5.335   Min.   : 0.000   Min.   :604.5   Min.   :605.4   Min.   :14.00   Min.   :605.5  
##  1st Qu.:4906   1st Qu.:10.639   1st Qu.: 1.941   1st Qu.:640.4   1st Qu.:639.4   1st Qu.:18.58   1st Qu.:640.0  
##  Median :5215   Median :13.728   Median : 8.778   Median :655.8   Median :652.5   Median :19.72   Median :654.5  
##  Mean   :5312   Mean   :15.317   Mean   :15.768   Mean   :655.0   Mean   :653.3   Mean   :19.64   Mean   :654.2  
##  3rd Qu.:5601   3rd Qu.:17.629   3rd Qu.:22.970   3rd Qu.:668.7   3rd Qu.:665.9   3rd Qu.:20.87   3rd Qu.:666.7  
##  Max.   :7712   Max.   :55.328   Max.   :85.540   Max.   :704.0   Max.   :709.5   Max.   :25.80   Max.   :706.8  
## 
```

---
- This returns the desriptive statistics for all the variables in dataframe.
- You can combine this with `dplyr::select`

```r
CASchools %>% 
  select(STR, score) %>%
  summary()
```

```
##       STR            score      
##  Min.   :14.00   Min.   :605.5  
##  1st Qu.:18.58   1st Qu.:640.0  
##  Median :19.72   Median :654.5  
##  Mean   :19.64   Mean   :654.2  
##  3rd Qu.:20.87   3rd Qu.:666.7  
##  Max.   :25.80   Max.   :706.8
```

---
- You can do a bit lengthly thing manually like this.

```r
# compute sample averages of STR and score
avg_STR <- mean(CASchools$STR) 
avg_score <- mean(CASchools$score)

# compute sample standard deviations of STR and score
sd_STR <- sd(CASchools$STR) 
sd_score <- sd(CASchools$score)

# set up a vector of percentiles and compute the quantiles 
quantiles <- c(0.10, 0.25, 0.4, 0.5, 0.6, 0.75, 0.9)
quant_STR <- quantile(CASchools$STR, quantiles)
quant_score <- quantile(CASchools$score, quantiles)

# gather everything in a data.frame 
DistributionSummary <- data.frame(Average = c(avg_STR, avg_score), 
                                  StandardDeviation = c(sd_STR, sd_score), 
                                  quantile = rbind(quant_STR, quant_score))
```

---

```r
DistributionSummary
```

```
##               Average StandardDeviation quantile.10. quantile.25. quantile.40. quantile.50. quantile.60. quantile.75. quantile.90.
## quant_STR    19.64043          1.891812      17.3486     18.58236     19.26618     19.72321      20.0783     20.87181     21.86741
## quant_score 654.15655         19.053347     630.3950    640.05000    649.06999    654.45000     659.4000    666.66249    678.85999
```
---
- My personal favorite is to use `stargazer` function.

```r
stargazer(CASchools, type = "text")
```

```
## 
## ===========================================================================
## Statistic    N    Mean    St. Dev.     Min    Pctl(25)  Pctl(75)     Max   
## ---------------------------------------------------------------------------
## students    420 2,628.793 3,913.105    81        379      3,008    27,176  
## teachers    420  129.067   187.913    4.850    19.662    146.350  1,429.000
## calworks    420  13.246    11.455     0.000     4.395    18.981    78.994  
## lunch       420  44.705    27.123     0.000    23.282    66.865    100.000 
## computer    420  303.383   441.341      0        46       375.2     3,324  
## expenditure 420 5,312.408  633.937  3,926.070 4,906.180 5,601.401 7,711.507
## income      420  15.317     7.226     5.335    10.639    17.629    55.328  
## english     420  15.768    18.286       0        1.9      23.0       86    
## read        420  654.970   20.108    604.500   640.400   668.725   704.000 
## math        420  653.343   18.754      605      639.4     665.8      710   
## STR         420  19.640     1.892    14.000    18.582    20.872    25.800  
## score       420  654.157   19.053    605.550   640.050   666.662   706.750 
## ---------------------------------------------------------------------------
```

---
- You can choose summary statistics you want to report.

```r
CASchools %>% 
  stargazer( type = "text", summary.stat = c("n", "p75", "sd") ) 
```

```
## 
## ===================================
## Statistic    N  Pctl(75)  St. Dev. 
## -----------------------------------
## students    420   3,008   3,913.105
## teachers    420  146.350   187.913 
## calworks    420  18.981    11.455  
## lunch       420  66.865    27.123  
## computer    420   375.2    441.341 
## expenditure 420 5,601.401  633.937 
## income      420  17.629     7.226  
## english     420   23.0     18.286  
## read        420  668.725   20.108  
## math        420   665.8    18.754  
## STR         420  20.872     1.892  
## score       420  666.662   19.053  
## -----------------------------------
```
---
<br/><br/>
- See https://www.jakeruss.com/cheatsheets/stargazer/#the-default-summary-statistics-table for the details.
- `stargazer` can be also used to report regression results.
- But, we will use `texreg` instead.

---
## Scatter plot

- Let's see how test score and student-teacher-ratio is correlated.

```r
plot(score ~ STR, 
     data = CASchools,
     main = "Scatterplot of TestScore and STR", 
     xlab = "STR (X)",
     ylab = "Test Score (Y)")
```

---
.center[
![](Regression2_Implementation_files/figure-html/unnamed-chunk-14-1.png)
]

---
<br/><br/>
- Use `cor()` to compute the correlation between two numeric vectors.

```r
cor(CASchools$STR, CASchools$score)
```

```
## [1] -0.2263627
```
---
class: title-slide-section, center, middle
name: logistics
# Step 2: Run regression
---
## Simple linear regression

- We use `lm()` function to run linear regression

- First, consider the simple linear regression 
`$$score_i = \beta_0 + \beta_1 size_i + \epsilon_i$$`
where `$size_i$` is the class size (student-teacher-ratio).
    - From now on we call student-teacher-ratio (STR) class size.
    
---
- To run this regression, we use `lm`

```r
# First, we rename the variable `STR`
CASchools %>% 
  dplyr::rename( size = STR) -> CASchools

# Run regression and save results in the varaiable `model1_summary`
model1_summary <- lm( score ~ size, data = CASchools)

# See the results
summary(model1_summary)
```

```
## 
## Call:
## lm(formula = score ~ size, data = CASchools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.727 -14.251   0.483  12.822  48.540 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 698.9329     9.4675  73.825  < 2e-16 ***
## size         -2.2798     0.4798  -4.751 2.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared:  0.05124,	Adjusted R-squared:  0.04897 
## F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06
```

---
- Interpretations
    - An increase of one student per teacher leads to 2.2 point decrease in test scores. 
    - p value is very small. The effect of the class size on test score is significant.
        - Note: Be careful. These standard errors are NOT heteroskedasiticity robust. We will come back to this point soon.
    - `$R^2 = 0.051$`, implying that 5.1% of the variance of the dependent variable is explained by the model.
    
- You can add more variable in the regression (will see this soon)

---
## Robust standard error with `lm_robust`

- We use `lm_robust()` in `estimatr` package to run regression with robust standard error.

```r
model1_robust <- lm_robust( score ~ size, data = CASchools, se_type = "HC1")

summary(model1_robust)
```

```
## 
## Call:
## lm_robust(formula = score ~ size, data = CASchools, se_type = "HC1")
## 
## Standard error type:  HC1 
## 
## Coefficients:
##             Estimate Std. Error t value   Pr(>|t|) CI Lower CI Upper  DF
## (Intercept)   698.93    10.3644  67.436 9.487e-227  678.560  719.306 418
## size           -2.28     0.5195  -4.389  1.447e-05   -3.301   -1.259 418
## 
## Multiple R-squared:  0.05124 ,	Adjusted R-squared:  0.04897 
## F-statistic: 19.26 on 1 and 418 DF,  p-value: 1.447e-05
```

- Notice that robust standard errors are larger than the one we obtained from `lm`!

---
## Report by texreg

- `texreg` is useful to show the regression result. 
    - `screenreg` function shows the table on R markdown.
    - You can use `htmlreg` (`texreg`) to get html (latex) format.
- `stargazer` is also used to show regression results, however it does not follow `lm_robust`.

```r
# Create output by `screenreg` function.
screenreg(l=list(model1_summary, model1_robust),
          digits = 3,
          # caption = 'title',
          custom.model.names = c("model1", "model1 Robust"),
          custom.coef.names = NULL,  # add a class, if you want to change the names of variables.
          include.ci = F,
          include.rsquared = FALSE, include.adjrs = TRUE, include.nobs = TRUE, 
          include.pvalues = FALSE, include.df = FALSE, include.rmse = FALSE,
          custom.header = list("score" = 1:2), # you can add header especially to indicate dependent variable 
          stars = numeric(0) # to delete star expression
          )
```

---

```
## 
## ====================================
##                       score         
##              -----------------------
##              model1    model1 Robust
## ------------------------------------
## (Intercept)  698.933   698.933      
##               (9.467)  (10.364)     
## size          -2.280    -2.280      
##               (0.480)   (0.519)     
## ------------------------------------
## Adj. R^2       0.049     0.049      
## Num. obs.    420       420          
## ====================================
```

---
## Full results

Taken from https://www.econometrics-with-r.org/7-6-analysis-of-the-test-score-data-set.html

```r
# estimate different model specifications
spec1 <- lm_robust(score ~ size, data = CASchools, se_type = "HC1")
spec2 <- lm_robust(score ~ size + english, data = CASchools, se_type = "HC1")
spec3 <- lm_robust(score ~ size + english + lunch, data = CASchools, se_type = "HC1")
spec4 <- lm_robust(score ~ size + english + calworks, data = CASchools, se_type = "HC1")
spec5 <- lm_robust(score ~ size + english + lunch + calworks, data = CASchools, se_type = "HC1")

# generate a table using texreg
screenreg(l = list(spec1, spec2, spec3, spec4, spec5),
          digits = 3,
          # caption = 'title',
          custom.model.names = c("(I)", "(II)", "(III)", "(IV)", "(V)"),
          custom.coef.names = NULL,  # add a class, if you want to change the names of variables.
          include.ci = F,
          include.rsquared = FALSE, include.adjrs = TRUE, include.nobs = TRUE, 
          include.pvalues = FALSE, include.df = FALSE, include.rmse = FALSE,
          custom.header = list("score" = 1:2), # you can add header especially to indicate dependent variable 
          stars = numeric(0) # to delete star expression
          )
```
---

```
## 
## =============================================================
##                     score      
##              ------------------
##              (I)       (II)      (III)     (IV)      (V)     
## -------------------------------------------------------------
## (Intercept)  698.933   686.032   700.150   697.999   700.392 
##              (10.364)   (8.728)   (5.568)   (6.920)   (5.537)
## size          -2.280    -1.101    -0.998    -1.308    -1.014 
##               (0.519)   (0.433)   (0.270)   (0.339)   (0.269)
## english                 -0.650    -0.122    -0.488    -0.130 
##                         (0.031)   (0.033)   (0.030)   (0.036)
## lunch                             -0.547              -0.529 
##                                   (0.024)             (0.038)
## calworks                                    -0.790    -0.048 
##                                             (0.068)   (0.059)
## -------------------------------------------------------------
## Adj. R^2       0.049     0.424     0.773     0.626     0.773 
## Num. obs.    420       420       420       420       420     
## =============================================================
```
- The coefficient on the class size decreases as we add more explantory variables. Can you explain why? (Hint: omitted variable bias)