11 Exercise 2 (Problem Set 3)

Due date: June 4th (Tue) 11pm.

11.1 Rules

If you are enrolled in Japanese class (i.e., Wednesday 2nd), you can use both Japanese and English to write your answer.
Submit your solution through CourseN@vi.
Important: Submission format
If you use Rmarkdown, please compile your Rmarkdown file into either “html” or “PDF” file and submit both the compiled file and a Rmarkdown file.
If you do not use Rmarkdown, please submit the document file that contains your answer and R script file (.R file) separately, that is, you submit two files.

11.2 Question 1: Omitted Variable Bias

The goal of this question is to investigate the omitted variable bias through Monte Carlo simulations. Consider the following model

\[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i \]

You compare the sampling distribution of OLS estimates for \(\beta_1\) with and without \(x_2\) included in the regression. Here is the suggested procedure for this excercise.

Set the data generating process.
- Set the parameters \(\beta_0 = 1, \beta_1 = 2, \beta_2 = 1\)
- The explanatory variables \((x_1, x_2)\) are i.i.d. drawn from the multivariate normal distribution \[ \left(\begin{array}{c} x_{1}\\ x_{2} \end{array}\right)\sim N\left(\left(\begin{array}{c} 3\\ 4 \end{array}\right),\left(\begin{array}{cc} 2 & 1\\ 1 & 2 \end{array}\right)\right) \]
- The error term \(\epsilon_it\) is i.i.d. drawn from \(N(0, 1)\)
Draw the dataset \(\{y_i, x_{i1}, x_{i2} \}_{i=1}^N\) with \(N = 200\).
- To draw the random numbers from the joint normal distribution, use mvrnorm function from MASS package.
Using the drawn dataset, regress \(y\) on \(x_1\) and \(x_2\) with constant term. Obtain the OLS estimate for \(\beta_1\). Let’s call this \(\hat\beta_1^{long}\)
Regress \(y\) on \(x_1\) with constant term by omitting \(x_2\) and obtain the OLS estimate for \(\beta_1\). Let’s call this \(\hat\beta_1^{short}\)
Repeat step 2 to 4 for \(500\) times and obtain \(\hat\beta_1^{long}\) and \(\hat\beta_1^{short}\) for each drawn sample.
Plot the distribution of \(\hat\beta_1^{long}\) and \(\hat\beta_1^{short}\) across samples.

Please answer the following questions using your simulation results.

1. Show the sampling distribution for \(\hat\beta_1^{long}\) and \(\hat\beta_1^{short}\).
1. Are these estimates biased? If biased, is the magnitude of bias consistent with theory?
1. We set \(Cov(x_1, x_2)=1\) above. Repeat the same simulation with \(Cov(x_1, x_2)=0\). How does the result would change?

11.3 Question 2: Empirical Analysis using Data from Washington(2008, AER)

Acknowledgement: This exercise is based on the material from Econ 281 “Introductory Applied Econometrics” in Winter 2017 taught by Daley Kutzman at Northwestern University

This exercise uses the data from Ebonya Washington’s paper, “Female Socialization: How Daughters Affect Their Legislator Father’s Voting on Women’s Issues,” published in American Economic Review in 2008. This paper studies whether having a daughter affects legislator’s voting on women’s issues.

11.3.1 Preliminary: data cleaning

You can find the file “data_PS3_basic.dta” that is available at the journal website. This file is in Stata format. You can use read.dta function included in foreign packages.

# Example: 
library(foreign)
mydata <- read.dta("c:/mydata.dta")

The original dataset contains data from the 105th to 108th U.S. Congress. We only use the observation from the 105th congress. The variable congress indicates this information. Use filter function in dplyr to subtract observations from the 105th.

The dataset contains many variables, some of which are not used in this exercise. Keep the following variables in the final dataset (Hint: use select function in dplyr).

Name	Description
aauw	AAUW score
totchi	Total number of children
ngirls	Number of daughters
party	Political party. Democrats if 1, Republicans if 2, and Independent if 3.
famale	Female dummy variable
white	White dummy variable
srvlng	Years of service
age	Age
demvote	State democratic vote share in most recent presidential election
medinc	District median income
perf	Female proportion of district voting age population
perw	White proportion of total district population
perhs	High school graduate proportion of district population age 25
percol	College graduate proportion of district population age 25
perur	Urban proportion of total district population
moredef	State proportion who favor more defense spending
stateabb	State abbreviation
district	id for electoral district

You can find the detailed description of each variable in the original paper. The main variable in this analysis is AAUW, a score created by the American Association of University Women (AAUW). For each congress, AAUW selects pieces of legislation in the areas of education, equality, and reproductive rights. The AAUW keeps track of how each legislator voted on these pieces of legislation and whether their vote aligned with the AAUW’s position. The legislator’s score is equal to the proportion of these votes made in agreement with the AAUW.

11.3.2 Questions

Report summary statistics of the following variables in the dataset: political party, age, race, gender, AAUW score, the number of children, and the number of daughters.
Estimate the following linear regression models using lm command. Do not forget to correct the standard errors! Report your regression results in a table. \[ \begin{aligned} aauw_i = \ & \beta_0 + \beta_1 ngirls_i + \epsilon_i \\ aauw_i = \ & \beta_0 + \beta_1 ngirls_i + \beta_2 totchi + \epsilon_i \\ aauw_i = \ & \beta_0 + \beta_1 ngirls_i + \beta_2 totchi + \beta_3 famale_i + \beta_4 repub_i + \epsilon_i \end{aligned} \]
- All the variables used in the above specifications are in the dataset except for \(repub_i\). \(repub_i\) takes 1 if the legislator \(i\) is affiliated with the Republican party.
- Important Never put the raw output from lm command shown in R console into your answer! Please prepare a table for regression results as if you write a report or a paper. If you copy and paste the raw output from lm command, you will get 0 points for the empirical exercise part of this problem set.
Compare the OLS estimates of \(\beta_1\) across the above three specifications. Discuss what explains the difference (if any) of the estimate across three specifications?
Consider the third specification (with 3 controls in addition to \(ngirls_i\)). Conditional on the number of children and other variables, do you think \(ngrils_i\) is plausibly exogenous (i.e., uncorrelated with the error term)? Discuss.
It is possible that the effects of having daughters might be different for female and male legislators. Estimate a regression model that allow for heterogenous effects of daughters for male and female. Discuss whether this story is true or not.