9 Linear Regression 2: Implementation in R

9.1 Implementation in R

9.1.1 Preliminary: packages

  • We use the following packages:
    • AER :
    • dplyr : data manipulation
    • stargazer : output of regression results
# Install package if you have not done so 
# install.packages("AER")
# install.packages("dplyr")
# install.packages("stargazer")
# install.packages("lmtest")

# load packages
library("AER")
##  要求されたパッケージ car をロード中です
##  要求されたパッケージ carData をロード中です
##  要求されたパッケージ lmtest をロード中です
##  要求されたパッケージ zoo をロード中です
## 
##  次のパッケージを付け加えます: 'zoo'
##  以下のオブジェクトは 'package:base' からマスクされています: 
## 
##      as.Date, as.Date.numeric
##  要求されたパッケージ sandwich をロード中です
##  要求されたパッケージ survival をロード中です
library("dplyr")
## 
##  次のパッケージを付け加えます: 'dplyr'
##  以下のオブジェクトは 'package:car' からマスクされています: 
## 
##      recode
##  以下のオブジェクトは 'package:stats' からマスクされています: 
## 
##      filter, lag
##  以下のオブジェクトは 'package:base' からマスクされています: 
## 
##      intersect, setdiff, setequal, union
library("stargazer")
## 
## Please cite as:
##  Hlavac, Marek (2018). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.2. https://CRAN.R-project.org/package=stargazer
library("lmtest")

9.1.2 Empirical setting: Data from California School

# load the the data set in the workspace
data(CASchools)
  • Use class() function to see CASchools is data.frame object.
class(CASchools)
## [1] "data.frame"
  • We take 2 steps for the analysis.
    • Step 1: Look at data (descriptive analysis)
    • Step 2: Run regression

9.1.3 Step 1: Descriptive analysis

  • It is always important to grasp your data before running regression.
  • head() function give you a first overview of the data.
head(CASchools)
##   district                          school  county grades students
## 1    75119              Sunol Glen Unified Alameda  KK-08      195
## 2    61499            Manzanita Elementary   Butte  KK-08      240
## 3    61549     Thermalito Union Elementary   Butte  KK-08     1550
## 4    61457 Golden Feather Union Elementary   Butte  KK-08      243
## 5    61523        Palermo Union Elementary   Butte  KK-08     1335
## 6    62042         Burrel Union Elementary  Fresno  KK-08      137
##   teachers calworks   lunch computer expenditure    income   english  read
## 1    10.90   0.5102  2.0408       67    6384.911 22.690001  0.000000 691.6
## 2    11.15  15.4167 47.9167      101    5099.381  9.824000  4.583333 660.5
## 3    82.90  55.0323 76.3226      169    5501.955  8.978000 30.000002 636.3
## 4    14.00  36.4754 77.0492       85    7101.831  8.978000  0.000000 651.9
## 5    71.50  33.1086 78.4270      171    5235.988  9.080333 13.857677 641.8
## 6     6.40  12.3188 86.9565       25    5580.147 10.415000 12.408759 605.7
##    math
## 1 690.0
## 2 661.9
## 3 650.9
## 4 643.5
## 5 639.9
## 6 605.4
  • Alternatively, you can use browse() to see the entire dataset in browser window.

9.1.3.1 Create variables

  • Create several variables that are needed for the analysis.
  • We use dplyr for this purpose.
CASchools %>% 
  mutate( STR = students / teachers ) %>% 
  mutate( score = (read + math) / 2 ) -> CASchools 

9.1.3.2 Descriptive statistics

  • There are several ways to show descriptive statistics
  • The standard one is to use summary() function
summary(CASchools)
##    district            school                  county      grades   
##  Length:420         Length:420         Sonoma     : 29   KK-06: 61  
##  Class :character   Class :character   Kern       : 27   KK-08:359  
##  Mode  :character   Mode  :character   Los Angeles: 27              
##                                        Tulare     : 24              
##                                        San Diego  : 21              
##                                        Santa Clara: 20              
##                                        (Other)    :272              
##     students          teachers          calworks          lunch       
##  Min.   :   81.0   Min.   :   4.85   Min.   : 0.000   Min.   :  0.00  
##  1st Qu.:  379.0   1st Qu.:  19.66   1st Qu.: 4.395   1st Qu.: 23.28  
##  Median :  950.5   Median :  48.56   Median :10.520   Median : 41.75  
##  Mean   : 2628.8   Mean   : 129.07   Mean   :13.246   Mean   : 44.71  
##  3rd Qu.: 3008.0   3rd Qu.: 146.35   3rd Qu.:18.981   3rd Qu.: 66.86  
##  Max.   :27176.0   Max.   :1429.00   Max.   :78.994   Max.   :100.00  
##                                                                       
##     computer       expenditure       income          english      
##  Min.   :   0.0   Min.   :3926   Min.   : 5.335   Min.   : 0.000  
##  1st Qu.:  46.0   1st Qu.:4906   1st Qu.:10.639   1st Qu.: 1.941  
##  Median : 117.5   Median :5215   Median :13.728   Median : 8.778  
##  Mean   : 303.4   Mean   :5312   Mean   :15.317   Mean   :15.768  
##  3rd Qu.: 375.2   3rd Qu.:5601   3rd Qu.:17.629   3rd Qu.:22.970  
##  Max.   :3324.0   Max.   :7712   Max.   :55.328   Max.   :85.540  
##                                                                   
##       read            math            STR            score      
##  Min.   :604.5   Min.   :605.4   Min.   :14.00   Min.   :605.5  
##  1st Qu.:640.4   1st Qu.:639.4   1st Qu.:18.58   1st Qu.:640.0  
##  Median :655.8   Median :652.5   Median :19.72   Median :654.5  
##  Mean   :655.0   Mean   :653.3   Mean   :19.64   Mean   :654.2  
##  3rd Qu.:668.7   3rd Qu.:665.9   3rd Qu.:20.87   3rd Qu.:666.7  
##  Max.   :704.0   Max.   :709.5   Max.   :25.80   Max.   :706.8  
## 
  • This returns the desriptive statistics for all the variables in dataframe.
  • You can combine this with dplyr::select
CASchools %>% 
  select(STR, score) %>%
  summary()
##       STR            score      
##  Min.   :14.00   Min.   :605.5  
##  1st Qu.:18.58   1st Qu.:640.0  
##  Median :19.72   Median :654.5  
##  Mean   :19.64   Mean   :654.2  
##  3rd Qu.:20.87   3rd Qu.:666.7  
##  Max.   :25.80   Max.   :706.8
  • You can do a bit lengthly thing manually like this.
# compute sample averages of STR and score
avg_STR <- mean(CASchools$STR) 
avg_score <- mean(CASchools$score)

# compute sample standard deviations of STR and score
sd_STR <- sd(CASchools$STR) 
sd_score <- sd(CASchools$score)

# set up a vector of percentiles and compute the quantiles 
quantiles <- c(0.10, 0.25, 0.4, 0.5, 0.6, 0.75, 0.9)
quant_STR <- quantile(CASchools$STR, quantiles)
quant_score <- quantile(CASchools$score, quantiles)

# gather everything in a data.frame 
DistributionSummary <- data.frame(Average = c(avg_STR, avg_score), 
                                  StandardDeviation = c(sd_STR, sd_score), 
                                  quantile = rbind(quant_STR, quant_score))

# print the summary to the console
DistributionSummary
##               Average StandardDeviation quantile.10. quantile.25.
## quant_STR    19.64043          1.891812      17.3486     18.58236
## quant_score 654.15655         19.053347     630.3950    640.05000
##             quantile.40. quantile.50. quantile.60. quantile.75.
## quant_STR       19.26618     19.72321      20.0783     20.87181
## quant_score    649.06999    654.45000     659.4000    666.66249
##             quantile.90.
## quant_STR       21.86741
## quant_score    678.85999
  • My personal favorite is to use stargazer function.
stargazer(CASchools, type = "text")
## 
## ===========================================================================
## Statistic    N    Mean    St. Dev.     Min    Pctl(25)  Pctl(75)     Max   
## ---------------------------------------------------------------------------
## students    420 2,628.793 3,913.105    81        379      3,008    27,176  
## teachers    420  129.067   187.913    4.850    19.662    146.350  1,429.000
## calworks    420  13.246    11.455     0.000     4.395    18.981    78.994  
## lunch       420  44.705    27.123     0.000    23.282    66.865    100.000 
## computer    420  303.383   441.341      0        46       375.2     3,324  
## expenditure 420 5,312.408  633.937  3,926.070 4,906.180 5,601.401 7,711.507
## income      420  15.317     7.226     5.335    10.639    17.629    55.328  
## english     420  15.768    18.286       0        1.9      23.0       86    
## read        420  654.970   20.108    604.500   640.400   668.725   704.000 
## math        420  653.343   18.754      605      639.4     665.8      710   
## STR         420  19.640     1.892    14.000    18.582    20.872    25.800  
## score       420  654.157   19.053    605.550   640.050   666.662   706.750 
## ---------------------------------------------------------------------------
  • You can choose summary statistics you want to report.
CASchools %>% 
  stargazer( type = "text", summary.stat = c("n", "p75", "sd") ) 
## 
## ===================================
## Statistic    N  Pctl(75)  St. Dev. 
## -----------------------------------
## students    420   3,008   3,913.105
## teachers    420  146.350   187.913 
## calworks    420  18.981    11.455  
## lunch       420  66.865    27.123  
## computer    420   375.2    441.341 
## expenditure 420 5,601.401  633.937 
## income      420  17.629     7.226  
## english     420   23.0     18.286  
## read        420  668.725   20.108  
## math        420   665.8    18.754  
## STR         420  20.872     1.892  
## score       420  666.662   19.053  
## -----------------------------------

9.1.3.3 Scatter plot

  • Let’s see how test score and student-teacher-ratio is correlated.
plot(score ~ STR, 
     data = CASchools,
     main = "Scatterplot of TestScore and STR", 
     xlab = "STR (X)",
     ylab = "Test Score (Y)")

  • Use cor() to compute the correlation between two numeric vectors.
cor(CASchools$STR, CASchools$score)
## [1] -0.2263627

9.1.4 Step 2: Run regression

9.1.4.1 Simple linear regression

  • We use lm() function to run linear regression
  • First, consider the simple linear regression \[ score_i = \beta_0 + \beta_1 size_i + \epsilon_i \] where \(size_i\) is the class size (student-teacher-ratio).
    • From now on we call student-teacher-ratio (STR) class size.
  • To run this regression, we use lm
# First, we rename the variable `STR`
CASchools %>% 
  dplyr::rename( size = STR) -> CASchools

# Run regression and save results in the varaiable `model1_summary`
model1_summary <- lm( score ~ size, data = CASchools)

# See the results
summary(model1_summary)
## 
## Call:
## lm(formula = score ~ size, data = CASchools)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.727 -14.251   0.483  12.822  48.540 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 698.9329     9.4675  73.825 < 0.0000000000000002 ***
## size         -2.2798     0.4798  -4.751           0.00000278 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared:  0.05124,    Adjusted R-squared:  0.04897 
## F-statistic: 22.58 on 1 and 418 DF,  p-value: 0.000002783
  • Interpretations
    • An increase of one student per teacher leads to 2.2 point decrease in test scores.
    • p value is very small. The effect of the class size on test score is significant. Note: Be careful. These standard errors are NOT heteroskedasiticity robust. We will come back to this point soon.
    • \(R^2 = 0.051\), implying that 5.1% of the variance of the dependent variable is explained by the model.
  • You can add more variable in the regression (will see this soon)

9.1.4.2 Correction of Robust standard error

  • We use vcovHC() function, a partof the package sandwich, to obtain the robust standard errors.
    • The package sandwich is automatically loaded if you load AER package.
# compute heteroskedasticity-robust standard errors
vcov <- vcovHC(model1_summary, type = "HC1")

# get standard error: the square root of the diagonal element in vcov
robust_se <- sqrt(diag(vcov))
robust_se
## (Intercept)        size 
##  10.3643617   0.5194893
  • Notice that robust standard errors are larger than the one we obtained from lm!
  • How to combine the robust standard errors with the original summary? Use coeftest() from the package lmtest
# load `lmtest`
library(lmtest)

# Combine robust standard errors
coeftest(model1_summary, vcov. = vcov)
## 
## t test of coefficients:
## 
##              Estimate Std. Error t value              Pr(>|t|)    
## (Intercept) 698.93295   10.36436 67.4362 < 0.00000000000000022 ***
## size         -2.27981    0.51949 -4.3886            0.00001447 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

9.1.4.3 Report by Stargazer

  • stargazer is useful to show the regression result.
# load
library(stargazer)

# Create output by stargazer
stargazer::stargazer(model1_summary, type ="text")
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                score           
## -----------------------------------------------
## size                         -2.280***         
##                               (0.480)          
##                                                
## Constant                    698.933***         
##                               (9.467)          
##                                                
## -----------------------------------------------
## Observations                    420            
## R2                             0.051           
## Adjusted R2                    0.049           
## Residual Std. Error      18.581 (df = 418)     
## F Statistic           22.575*** (df = 1; 418)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01
  • Use robust standard errors in stargazer output
# Prepare robust standard errors in list
rob_se <- list( sqrt(diag(vcovHC(model1_summary, type = "HC1") ) ) ) 

# generate regression table. 

stargazer( model1_summary, 
           se = rob_se, 
           type = "text")
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                score           
## -----------------------------------------------
## size                         -2.280***         
##                               (0.519)          
##                                                
## Constant                    698.933***         
##                              (10.364)          
##                                                
## -----------------------------------------------
## Observations                    420            
## R2                             0.051           
## Adjusted R2                    0.049           
## Residual Std. Error      18.581 (df = 418)     
## F Statistic           22.575*** (df = 1; 418)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

9.1.4.4 Full results

Taken from https://www.econometrics-with-r.org/7-6-analysis-of-the-test-score-data-set.html

# load the stargazer library

# estimate different model specifications
spec1 <- lm(score ~ size, data = CASchools)
spec2 <- lm(score ~ size + english, data = CASchools)
spec3 <- lm(score ~ size + english + lunch, data = CASchools)
spec4 <- lm(score ~ size + english + calworks, data = CASchools)
spec5 <- lm(score ~ size + english + lunch + calworks, data = CASchools)

# gather robust standard errors in a listh
rob_se <- list(sqrt(diag(vcovHC(spec1, type = "HC1"))),
               sqrt(diag(vcovHC(spec2, type = "HC1"))),
               sqrt(diag(vcovHC(spec3, type = "HC1"))),
               sqrt(diag(vcovHC(spec4, type = "HC1"))),
               sqrt(diag(vcovHC(spec5, type = "HC1"))))

# generate a LaTeX table using stargazer
stargazer(spec1, spec2, spec3, spec4, spec5,
          se = rob_se,
          digits = 3,
          header = F,
          column.labels = c("(I)", "(II)", "(III)", "(IV)", "(V)"),
          type ="text", 
          keep.stat = c("N", "adj.rsq"))
## 
## ===================================================================
##                               Dependent variable:                  
##              ------------------------------------------------------
##                                      score                         
##                 (I)        (II)      (III)       (IV)       (V)    
##                 (1)        (2)        (3)        (4)        (5)    
## -------------------------------------------------------------------
## size         -2.280***   -1.101**  -0.998***  -1.308***  -1.014*** 
##               (0.519)    (0.433)    (0.270)    (0.339)    (0.269)  
##                                                                    
## english                 -0.650***  -0.122***  -0.488***  -0.130*** 
##                          (0.031)    (0.033)    (0.030)    (0.036)  
##                                                                    
## lunch                              -0.547***             -0.529*** 
##                                     (0.024)               (0.038)  
##                                                                    
## calworks                                      -0.790***    -0.048  
##                                                (0.068)    (0.059)  
##                                                                    
## Constant     698.933*** 686.032*** 700.150*** 697.999*** 700.392***
##               (10.364)   (8.728)    (5.568)    (6.920)    (5.537)  
##                                                                    
## -------------------------------------------------------------------
## Observations    420        420        420        420        420    
## Adjusted R2    0.049      0.424      0.773      0.626      0.773   
## ===================================================================
## Note:                                   *p<0.1; **p<0.05; ***p<0.01
  • The coefficient on the class size decreases as we add more explantory variables. Can you explain why? (Hint: omitted variable bias)