class: center, middle, inverse, title-slide # Instrumental Variable Estimation 2: Implementation in R ### Instructor: Yuta Toyama ### Last updated: 2021-07-02 --- class: title-slide-section, center, middle name: logistics # Introduction --- ## Introduction - I cover three examples of instrumental variable regressions. 1. Wage regression 2. Demand curve 3. Effects of Voter Turnout (Hansford and Gomez) --- class: title-slide-section, center, middle name: logistics # Wage regression --- ## Example 1: Wage regression - Use dataset "Mroz", cross-sectional labor force participation data that accompany "Introductory Econometrics" by Wooldridge. - Original data from *"The Sensitivity of an Empirical Model of Married Women's Hours of Work to Economic and Statistical Assumptions"* by Thomas Mroz published in *Econometrica* in 1987. - Detailed description of data: https://www.rdocumentation.org/packages/npsf/versions/0.4.2/topics/mroz ```r library("foreign") # You do not have to worry about a message "cannot read factor labels from Stata 5 files". data <- read.dta("data/MROZ.DTA") ``` ``` ## Warning in read.dta("data/MROZ.DTA"): cannot read factor labels from Stata 5 files ``` --- .pull-left-left[ - Describe data ```r library(stargazer) stargazer(data, type = "text") ``` ] .pull-right-right[ ``` ## ## =================================================================== ## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max ## ------------------------------------------------------------------- ## inlf 753 0.568 0.496 0 0 1 1 ## hours 753 740.576 871.314 0 0 1,516 4,950 ## kidslt6 753 0.238 0.524 0 0 0 3 ## kidsge6 753 1.353 1.320 0 0 2 8 ## age 753 42.538 8.073 30 36 49 60 ## educ 753 12.287 2.280 5 12 13 17 ## wage 428 4.178 3.310 0.128 2.263 4.971 25.000 ## repwage 753 1.850 2.420 0.000 0.000 3.580 9.980 ## hushrs 753 2,267.271 595.567 175 1,928 2,553 5,010 ## husage 753 45.121 8.059 30 38 52 60 ## huseduc 753 12.491 3.021 3 11 15 17 ## huswage 753 7.482 4.231 0.412 4.788 9.167 40.509 ## faminc 753 23,080.600 12,190.200 1,500 15,428 28,200 96,000 ## mtr 753 0.679 0.083 0.442 0.622 0.721 0.942 ## motheduc 753 9.251 3.367 0 7 12 17 ## fatheduc 753 8.809 3.572 0 7 12 17 ## unem 753 8.624 3.115 3 7.5 11 14 ## city 753 0.643 0.480 0 0 1 1 ## exper 753 10.631 8.069 0 4 15 45 ## nwifeinc 753 20.129 11.635 -0.029 13.025 24.466 96.000 ## lwage 428 1.190 0.723 -2.054 0.817 1.604 3.219 ## expersq 753 178.039 249.631 0 16 225 2,025 ## ------------------------------------------------------------------- ``` ] --- <br/><br/> - Consider the wage regression `$$\log(w_i) = \beta_0 + \beta_1 educ_i + \beta_2 exper_i + \beta_3 exper_i^2 + \epsilon_i$$` - We assume that `\(exper_i\)` is exogenous but `\(educ_i\)` is endogenous. - As an instrument for `\(educ_i\)`, we use the years of schooling for his or her father and mother, which we call `\(fathereduc_i\)` and `\(mothereduc_i\)`. - Discussion on these IVs will be later. --- ```r library("AER") library("dplyr") library("texreg") library("estimatr") # data cleaning data %>% select(lwage, educ, exper, expersq, motheduc, fatheduc) %>% filter( is.na(lwage) == 0 ) -> data result_OLS <- lm_robust( lwage ~ educ + exper + expersq, data = data, se_type = "HC1") # IV regression using fathereduc and mothereduc result_IV <- iv_robust(lwage ~ educ + exper + expersq | fatheduc + motheduc + exper + expersq, data = data, se_type = "HC1") # Show result screenreg(l = list(result_OLS, result_IV),digits = 3, # caption = 'title', # custom.model.names = c("(I)", "(II)", "(III)", "(IV)", "(V)"), custom.coef.names = NULL, # add a class, if you want to change the names of variables. include.ci = F,include.rsquared = FALSE, include.adjrs = TRUE, include.nobs = TRUE, include.pvalues = FALSE, include.df = FALSE, include.rmse = FALSE, custom.header = list("lwage" = 1:2), # you can add header especially to indicate dependent variable stars = numeric(0)) ``` --- ``` ## ## =============================== ## lwage ## ------------------ ## Model 1 Model 2 ## ------------------------------- ## (Intercept) -0.522 0.048 ## (0.202) (0.430) ## educ 0.107 0.061 ## (0.013) (0.033) ## exper 0.042 0.044 ## (0.015) (0.016) ## expersq -0.001 -0.001 ## (0.000) (0.000) ## ------------------------------- ## Adj. R^2 0.151 0.130 ## Num. obs. 428 428 ## =============================== ``` --- - How about the first stage? You should always check this!! ```r # First stage regression result_1st <- lm(educ ~ motheduc + fatheduc + exper + expersq, data = data) # F test linearHypothesis(result_1st, c("fatheduc = 0", "motheduc = 0" ), vcov = vcovHC, type = "HC1") ``` ``` ## Linear hypothesis test ## ## Hypothesis: ## fatheduc = 0 ## motheduc = 0 ## ## Model 1: restricted model ## Model 2: educ ~ motheduc + fatheduc + exper + expersq ## ## Note: Coefficient covariance matrix supplied. ## ## Res.Df Df F Pr(>F) ## 1 425 ## 2 423 2 48.644 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- ## Discussion on IV - Labor economists have used family background variables as IVs for education. - **Relevance**: OK from the first stage regression. - **Independence**: A bit suspicious. Parents' education would be correlated with child's ability through quality of nurturing at an early age. - Still, we can see that these IVs can mitigate (though may not eliminate completely) the omitted variable bias. - Discussion on the validity of instruments is crucial in empirical research. --- class: title-slide-section, center, middle name: logistics # Demand curve --- ## Example 2: Estimation of the Demand for Cigaretts - Demand model is a building block in many branches of Economics. - For example, health economics is concerned with the study of how health-affecting behavior of individuals is influenced by the health-care system and regulation policy. - Smoking is a prominent example as it is related to many illnesses and negative externalities. - It is plausible that cigarette consumption can be reduced by taxing cigarettes more heavily. - Question: how much taxes must be increased to reach a certain reduction in cigarette consumption? -> Need to know **price elasticity of demand** for cigaretts. --- <br/><br/> - Use `CigarrettesSW` in the package `AER`. - a panel data set that contains observations on cigarette consumption and several economic indicators for all 48 continental federal states of the U.S. from 1985 to 1995. - What is **panel data**? The data involves both time series and cross-sectional information. - The variable is denoted as `\(y_{it}\)`, which indexed by individual `\(i\)` and time `\(t\)`. - Cross section data `\(y_i\)`: information for a particular individual `\(i\)` (e.g., income for person `\(i\)`). - Time series data `\(y_t\)`: information for a particular time period (e.g., GDP in year `\(y\)`) - Panel data `\(y_{it}\)`: income of person `\(i\)` in year `\(t\)`. - We will see more on panel data later in this course. For now, we use the panel data as just cross-sectional data (**pooled cross-sections**) --- ```r # load the data set and get an overview data("CigarettesSW") summary(CigarettesSW) ``` ``` ## state year cpi population packs ## AL : 2 1985:48 Min. :1.076 Min. : 478447 Min. : 49.27 ## AR : 2 1995:48 1st Qu.:1.076 1st Qu.: 1622606 1st Qu.: 92.45 ## AZ : 2 Median :1.300 Median : 3697472 Median :110.16 ## CA : 2 Mean :1.300 Mean : 5168866 Mean :109.18 ## CO : 2 3rd Qu.:1.524 3rd Qu.: 5901500 3rd Qu.:123.52 ## CT : 2 Max. :1.524 Max. :31493524 Max. :197.99 ## (Other):84 ## income tax price taxs ## Min. : 6887097 Min. :18.00 Min. : 84.97 Min. : 21.27 ## 1st Qu.: 25520384 1st Qu.:31.00 1st Qu.:102.71 1st Qu.: 34.77 ## Median : 61661644 Median :37.00 Median :137.72 Median : 41.05 ## Mean : 99878736 Mean :42.68 Mean :143.45 Mean : 48.33 ## 3rd Qu.:127313964 3rd Qu.:50.88 3rd Qu.:176.15 3rd Qu.: 59.48 ## Max. :771470144 Max. :99.00 Max. :240.85 Max. :112.63 ## ``` --- <br/> - Consider the following model `$$\log (Q_{it}) = \beta_0 + \beta_1 \log (P_{it}) + \beta_2 \log(income_{it}) + u_{it}$$` where - `\(Q_{it}\)` is the number of packs per capita in state `\(i\)` in year `\(t\)`, - `\(P_{it}\)` is the after-tax average real price per pack of cigarettes, and - `\(income_{it}\)` is the real income per capita. This is demand shifter. - As an IV for the price, we use the followings: - `\(SalesTax_{it}\)`: the proportion of taxes on cigarettes arising from the general sales tax. - Relevant as it is included in the after-tax price - Exogenous(indepndent) since the sales tax does not influence demand directly, but indirectly through the price. - `\(CigTax_{it}\)`: the cigarett-specific taxes --- ```r CigarettesSW %>% mutate( rincome = (income / population) / cpi) %>% mutate( rprice = price / cpi ) %>% mutate( salestax = (taxs - tax) / cpi ) %>% mutate( cigtax = tax/cpi ) -> Cigdata ``` - Let's run the regressions ```r cig_ols <- lm_robust(log(packs) ~ log(rprice) + log(rincome) , data = Cigdata,se_type = "HC1") #coeftest(cig_ols, vcov = vcovHC, type = "HC1") cig_ivreg <- iv_robust(log(packs) ~ log(rprice) + log(rincome) | log(rincome) + salestax + cigtax, data = Cigdata, se_type = "HC1") #coeftest(cig_ivreg, vcov = vcovHC, type = "HC1") # Show result screenreg(l = list(cig_ols, cig_ivreg),digits = 3, # caption = 'title', custom.model.names = c("OLS", "IV"),custom.coef.names = NULL, # add a class, if you want to change the names of variables. include.ci = F,include.rsquared = FALSE, include.adjrs = TRUE, include.nobs = TRUE, include.pvalues = FALSE, include.df = FALSE, include.rmse = FALSE, custom.header = list("log(packs)" = 1:2), # you can add header especially to indicate dependent variable stars = numeric(0) ) ``` --- ``` ## ## ============================== ## log(packs) ## ---------------- ## OLS IV ## ------------------------------ ## (Intercept) 10.067 9.736 ## (0.502) (0.514) ## log(rprice) -1.334 -1.229 ## (0.154) (0.155) ## log(rincome) 0.318 0.257 ## (0.154) (0.153) ## ------------------------------ ## Adj. R^2 0.542 0.539 ## Num. obs. 96 96 ## ============================== ``` --- <br/> - The first stage regression ```r # First stage regression result_1st <- lm(log(rprice) ~ log(rincome) + log(rincome) + salestax + cigtax , data= Cigdata) # F test linearHypothesis(result_1st, c("salestax = 0", "cigtax = 0" ), vcov = vcovHC, type = "HC1") ``` --- <br/><br/> ``` ## Linear hypothesis test ## ## Hypothesis: ## salestax = 0 ## cigtax = 0 ## ## Model 1: restricted model ## Model 2: log(rprice) ~ log(rincome) + log(rincome) + salestax + cigtax ## ## Note: Coefficient covariance matrix supplied. ## ## Res.Df Df F Pr(>F) ## 1 94 ## 2 92 2 127.77 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- class: title-slide-section, center, middle name: logistics # Voting --- ## Example 3: Effects of Turnout on Partisan Voting - THOMAS G. HANSFORD and BRAD T. GOMEZ "Estimating the Electoral Effects of Voter Turnout" The American Political Science Review Vol. 104, No. 2 (May 2010), pp. 268-288 - Link: https://www.cambridge.org/core/journals/american-political-science-review/article/estimating-the-electoral-effects-of-voter-turnout/8A880C28E79BE770A5CA1A9BB6CF933C - Here, we will see a simplified version of their analysis. - The dataset is [here](HansfordGomez_Data.csv) --- ```r library(readr) HGdata <- read_csv("data/HansfordGomez_Data.csv") stargazer::stargazer(as.data.frame(HGdata) %>% select(-starts_with("Yr")),type="text") ``` --- ``` ## ## ==================================================================================== ## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max ## ------------------------------------------------------------------------------------ ## Year 27,401 1,973.972 16.111 1,948 1,960 1,988 2,000 ## FIPS_County 27,401 29,985.500 13,081.250 4,001 20,013 39,157 56,045 ## Turnout 27,401 65.562 10.514 20.366 58.477 72.613 100.000 ## Closing2 27,401 23.053 13.042 0 11 30 125 ## Literacy 27,401 0.058 0.234 0 0 0 1 ## PollTax 27,401 0.001 0.023 0 0 0 1 ## Motor 27,401 0.211 0.408 0 0 0 1 ## GubElection 27,401 0.434 0.496 0 0 1 1 ## SenElection 27,401 0.680 0.467 0 0 1 1 ## GOP_Inc 27,401 0.501 0.500 0 0 1 1 ## DNormPrcp_KRIG 27,401 0.005 0.208 -0.419 -0.093 0.001 2.627 ## GOPIT 27,401 33.282 34.066 0 0 66.3 100 ## DemVoteShare2_3MA 27,401 44.250 10.606 10.145 37.006 50.996 88.982 ## DemVoteShare2 27,401 43.622 12.415 6.420 34.954 51.858 97.669 ## RainGOPI 27,401 0.007 0.142 -0 -0.03 0 2 ## TO_DVS23MA 27,401 2,886.877 792.530 473.161 2,321.025 3,384.772 8,526.616 ## Rain_DVS23MA 27,401 0.355 10.188 -25.054 -4.019 0.028 144.257 ## dph 27,401 0.021 0.145 0 0 0 1 ## dvph 27,401 0.018 0.133 0 0 0 1 ## rph 27,401 0.025 0.155 0 0 0 1 ## rvph 27,401 0.025 0.155 0 0 0 1 ## state_del 27,401 0.037 0.187 -0.821 -0.090 0.172 0.619 ## ------------------------------------------------------------------------------------ ``` --- ``` ## ## ====================================================================================== ## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max ## -------------------------------------------------------------------------------------- ## dph_StateVAP 27,401 77,525.150 597,474.000 0 0 0 6,150,988 ## dvph_StateVAP 27,401 63,138.400 663,707.600 0 0 0 12,700,000 ## rph_StateVAP 27,401 243,707.900 1,720,659.000 0 0 0 18,300,000 ## rvph_StateVAP 27,401 142,166.500 1,071,445.000 0 0 0 12,800,000 ## State_DVS_lag 27,401 46.896 8.317 22.035 40.767 52.197 80.872 ## State_DVS_lag2 27,401 2,268.381 786.199 485.533 1,661.934 2,724.515 6,540.244 ## -------------------------------------------------------------------------------------- ``` --- - Data description: | Name | Description| |---|-------------------------------------------------------------------| | Year | Election Year| | FIPS_County | FIPS County Code| | Turnout | Turnout as Pcnt VAP| | Closing2 | Days between registration closing date and election| | Literacy | Literacy Test| | PollTax | Poll Tax| | Motor | Motor Voter| | GubElection | Gubernatorial Election in State| | SenElection | U.S. Senate Election in State| | GOP_Inc | Republican Incumbent| --- | Name | Description| |---|-------------------------------------------------------------------| | Yr52 | 1952 Dummy| | Yr56 | 1956 Dummy| | Yr60 | 1960 Dummy| | Yr64 | 1964 Dummy| | Yr68 | 1968 Dummy| | Yr72 | 1972 Dummy| | Yr76 | 1976 Dummy| | Yr80 | 1980 Dummy| | Yr84 | 1984 Dummy| | Yr88 | 1988 Dummy| | Yr92 | 1992 Dummy| | Yr96 | 1996 Dummy| | Yr2000 | 2000 Dummy| --- <br/><br/> | Name | Description| |---|-------------------------------------------------------------------| | DNormPrcp_KRIG | Election day rainfall - differenced from normal rain for the day| | GOPIT | Turnout x Republican Incumbent| | DemVoteShare2_3MA | Partisan composition measure = 3 election moving avg. of Dem Vote Share| | DemVoteShare2 | Democratic Pres Candidate's Vote Share| | RainGOPI | Rainfall measure x Republican Incumbent| | TO_DVS23MA | Turnout x Partisan Composition measure| | Rain_DVS23MA | Rainfall measure x Partisan composition measure| | dph | =1 if home state of Dem pres candidate| | dvph | =1 if home state of Dem vice pres candidate| --- <br/><br/> | Name | Description| |---|-------------------------------------------------------------------| | rph | =1 if home state of Rep pres candidate| | rvph | =1 if home state of Rep vice pres candidate| | state_del | avg common space score for the House delegation| | dph_StateVAP | = dph*State voting age population| | dvph_StateVAP | = dvph*State voting age population| | rph_StateVAP | = rph*State voting age population| | rvph_StateVAP | = rvph*State voting age population| | State_DVS_lag | State-wide Dem vote share, lagged one election| | State_DVS_lag2 | State_DVS_lag squared| --- - Consider the following regression `$$DemoShare_{it} = \beta_0 + \beta_1 Turnout_{it} + u_t + u_{it}$$` where - `\(DemoShare_{it}\)` : Two-party vote share for Democrat candidate in county `\(i\)` in the presidential election in year `\(t\)` - `\(Turnout_{it}\)` : Turnout rate in county `\(i\)` in the presidential election in year `\(t\)` - `\(u_t\)` : **Year fixed effects**. Time dummies for each presidential election year - As an IV, we use the rainfall measure denoted by `DNormPrcp_KRIG` --- ```r # You can do this, but it is tedious. hg_ols <- lm_robust( DemVoteShare2 ~ Turnout + Yr52 + Yr56 + Yr60 + Yr64 + Yr68 + Yr72 + Yr76 + Yr80 + Yr84 + Yr88 + Yr92 + Yr96 + Yr2000, data = HGdata, se_type="HC1") #coeftest(hg_ols, vcov = vcovHC, type = "HC1") # By using "factor(Year)" as an explanatory variable, the regression automatically incorporates the dummies for each value. hg_ols <- lm_robust( DemVoteShare2 ~ Turnout + factor(Year) , data = HGdata, se_type="HC1") #coeftest(hg_ols, vcov = vcovHC, type = "HC1") # Iv regression hg_ivreg <- iv_robust( DemVoteShare2 ~ Turnout + factor(Year) | factor(Year) + DNormPrcp_KRIG, data = HGdata, se_type="HC1") #coeftest(hg_ivreg, vcov = vcovHC, type = "HC1") # Show result screenreg(l = list(hg_ols, hg_ivreg), digits = 3, # caption = 'title', custom.model.names = c("OLS", "IV"), custom.coef.names = NULL, # add a class, if you want to change the names of variables. include.ci = F, include.rsquared = FALSE, include.adjrs = TRUE, include.nobs = TRUE, include.pvalues = FALSE, include.df = FALSE, include.rmse = FALSE, custom.header = list("DemVoteShare2" = 1:2), # you can add header especially to indicate dependent variable stars = numeric(0) ) ``` --- ``` ## ## =================================== ## DemVoteShare2 ## ---------------------- ## OLS IV ## ----------------------------------- ## (Intercept) 59.085 26.910 ## (0.560) (11.024) ## Turnout -0.157 0.363 ## (0.008) (0.178) ## ----------------------------------- ## Adj. R^2 0.280 0.130 ## Num. obs. 27401 27401 ## =================================== ``` --- ```r # First stage regression hg_1st <- lm(Turnout ~ factor(Year) + DNormPrcp_KRIG, data= HGdata) # F test linearHypothesis(hg_1st, c("DNormPrcp_KRIG = 0" ), vcov = vcovHC, type = "HC1") ``` --- ``` ## Linear hypothesis test ## ## Hypothesis: ## DNormPrcp_KRIG = 0 ## ## Model 1: restricted model ## Model 2: Turnout ~ factor(Year) + DNormPrcp_KRIG ## ## Note: Coefficient covariance matrix supplied. ## ## Res.Df Df F Pr(>F) ## 1 27387 ## 2 27386 1 44.029 3.296e-11 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ```