1 Introduction

1.1 Introduction: Matching Estimator

Idea: Compare individuals with the same characteristics \(X\) across treatment and control groups
Key assumption: Treatment is random once we control for the observed characteristics.
Do you remember we already learnt a similar idea before?

Let \(X_{i}\) denote the observed characteristics:
- age, income, education, race, etc..
Assumption 1: \[D_{i}\perp(Y_{0i},Y_{1i})\left|X_{i}\right.\]
- Conditional on \(X_{i}\), no selection bias.
- Selection on observables assumption / ignorability
Assumption 2: Overlap assumption \[P(D_{i}=1|X_{i}=x)\in(0,1)\ \forall x\]
- Given \(x\), we should be able to observe people from both control and treatment group.
- We call \(P(D_{i}=1|X_{i}=x)\) propensity score.

The assumption implies that \[\begin{aligned} E[Y_{1i}|D_{i} & =1,X_{i}]=E[Y_{1i}|D_{i}=0,X_{i}]=E[Y_{1i}|X_{i}]\\ E[Y_{0i}|D_{i} & =1,X_{i}]=E[Y_{0i}|D_{i}=0,X_{i}]=E[Y_{0i}|X_{i}]\end{aligned}\]
The \(ATT\) for \(X_{i}=x\) is given by \[\begin{aligned} E[Y_{1i}-Y_{0i}|D_{i}=1,X_{i}] & =E[Y_{1i}|D_{i}=1,X_{i}]-E[Y_{0i}|D_{i}=1,X_{i}]\\ & =E[Y_{i}|D_{i}=1,X_{i}]-E[Y_{0i}|D_{i}=0,X_{i}]\\ & =\underbrace{E[Y_{i}|D_{i}=1,X_{i}]}_{avg\ with\ X_{i}\ in\ treatment}-\underbrace{E[Y_{i}|D_{i}=0,X_{i}]}_{avg\ with\ X_{i}\ in\ control}\end{aligned}\]
The components in the last line are identified (can be estimated).
Intuition: Comparing the outcome across control and treatment groups after conditioning on \(X_{i}\)

ATT is given by \[\begin{aligned} ATT & =E[Y_{1i}-Y_{0i}|D_{i}=1]\\ & =\int E[Y_{1i}-Y_{0i}|D_{i}=1,X_{i}=x]f_{X_{i}}(x|D_{i}=1)dx\\ & =E[Y_{i}|D_{i}=1]-\int\left(E[Y_{i}|D_{i}=0,X_{i}=x]\right)f_{X_{i}}(x|D_{i}=1)\end{aligned}\]
ATE is \[\begin{aligned} ATE= & E[Y_{1i}-Y_{0i}]\\ = & \int E[Y_{1i}-Y_{0i}|X_{i}=x]f_{X_{i}}(x)dx\\ = & \int E[Y_{i}|D_{i}=1,X_{i}=x]f_{X_{i}}(x)dx\\ = & +\int E[Y_{i}|D_{i}=0,X_{i}=x]f_{X_{i}}(x)dx\end{aligned}\]

We need to estimate \(E[Y_{i}|D_{i}=1,X_{i}=x]\) and \(E[Y_{i}|D_{i}=0,X_{i}=x]\)
Several ways to implement the above idea
1. Regression: Nonparametric and Parametric
2. Nearest neighborhood matching
3. Propensity Score Matching

Let \(\hat{\mu}_{k}(x)\) be an estimator of \(\mu_{k}(x)=E[Y_{i}|D_{i}=k,X_{i}=x]\) for \(k\in\{0,1\}\)
The analog estimators are \[\begin{aligned} \hat{ATE} & =\frac{1}{N}\sum_{i=1}^{N}\hat{\mu}_{1}(X_{i})-\hat{\mu}_{0}(X_{i})\\ \hat{ATT} & =\frac{N^{-1}\sum_{i=1}^{N}D_{i}(Y_{i}-\hat{\mu}_{0}(X_{i}))}{N^{-1}\sum_{i=1}^{N}D_{i}}\end{aligned}\]
How to estimate \(\mu_{k}(x)=E[Y_{i}|D_{i}=k,X_{i}=x]\) ?

Suppose that \(X_{i}\in\{x_{1},\cdots,x_{K}\}\) is discrete with small \(K\)
- Ex: two demographic characteristics (male/female, white/non-white). \(K=4\)
Then, a nonparametric binning estimator is \[\hat{\mu}_{k}(x)=\frac{\sum_{i=1}^{N}\mathbf{1}\{D_{i}=k,X_{i}=x\}Y_{i}}{\sum_{i=1}^{N}\mathbf{1}\{D_{i}=k,X_{i}=x\}}\]
Here, I do not put any parametric assumption on \(\mu_{k}(x)=E[Y_{i}|D_{i}=k,X_{i}=x]\).

Issue: Poor performance if \(K\) is large due to many covariates.
So many potential groups, too few observations for each group.
With \(K\) variables, each of which takes \(L\) values, \(L^K\) possible groups (bins) in total.
This is known as curse of dimensionality.
Relatedly, if \(X\) is a continuous random variable, can use kernel regression.

If you put parametric assumption such as \[\begin{aligned} E[Y_{i}|D_{i}=0,X_{i}=x] & =\beta'x_{i}\\ E[Y_{i}|D_{i}=1,X_{i}=x] & =\beta'x_{i}+\tau_{0}\end{aligned}\] then, you will have a model \[y_{i}=\beta'x_{i}+\tau D_{i}+\epsilon_{i}\]
You can think the matching estimator as controlling for omitted variable bias by adding (many) covariates (control variables) \(x_{i}\).
This is one reason why matching estimator may not be preferred in empirical research.
- Remember: Controlling for those covariates is of course important. This can be combined with other empirical strategies (IV, DID, etc).

Idea: Find the counterpart in other group that is close to me.
Define \(\hat{y}_{i}(0)\) and \(\hat{y}_{i}(1)\) be the estimator for (hypothetical) outcomes when treated and not treated. \[\hat{y}_{i}(0)=\begin{cases} y_{i} & if\ D_{i}=0\\ \frac{1}{M}\sum_{j\in L_{M}(i)}y_{j} & if\ D_{i}=1 \end{cases}\]
\(L_{M}(i)\) is the set of \(M\) individuals in the opposite group who are “close” to individual \(i\)
- Several ways to define the distance between \(X_{i}\) and \(X_{j}\), such as \[dist(X_{i},X_{j})=||X_{i}-X_{j}||^{2}\]
Need to choose (1) \(M\) and (2) the measure of distance
- R has several packages for this.

Use propensity score \(P(D_{i}=1|X_{i}=x)\) as a distance to define who is the closest to me.
Implementation:
1. Estimate propensity score function by logit or probit using a flexible function of \(X_i\).
2. Calculate the propensity score for each observation. Use it to define the pair.