We learned about John Snow's grand experiment in London 1850.
We used his story to motivate the IV estimator.
You took a quiz about some IV aspects.
We learned about John Snow's grand experiment in London 1850.
We used his story to motivate the IV estimator.
You took a quiz about some IV aspects.
We'll look at further IV applications.
We introduce an extension called Two Stage Least Squares.
We will use R
to compute the estimates.
Finally we'll talk about weak instruments.
What's the causal impact of schooling on earnings?
Jacob Mincer was interested in this important question.
Here's his model:
logYi=α+ρSi+β1Xi+β2X2i+ei
logYi=α+ρSi+β1Xi+β2X2i+ei
He found an estimate for ρ of about 0.11,
11% earnings advantage for each additional year of education
Look at the DAG. Is that a good model? Well, why would it not be?
We compare earnings of men with certain schooling and work experience
Is all else equal, after controlling for those?
Given X,
We compare earnings of men with certain schooling and work experience
Is all else equal, after controlling for those?
Given X,
Yes, of course. So, all else is not equal at all.
That's an issue, because for OLS consistency we require the orthogonality assumption E[ei|Si,Xi]≠0
Let's introduce ability Ai explicitly.
In fact we have two unobservables: e and A.
Of course we can't tell them apart.
So we defined a new unobservable factor ui=ei+Ai
In fact we have two unobservables: e and A.
Of course we can't tell them apart.
So we defined a new unobservable factor ui=ei+Ai
In terms of an equation: logYi=α+ρSi+β1Xi+β2X2i+uiAi+ei
Sometimes, this does not matter, and the OLS bias is small.
But sometimes it does and we get it totally wrong! Example.
Angrist and Krueger (AK91) is an influental study addressing ability bias.
Idea:
Suppose all children who reach the age of 6 by 31st of december 2021 are required to enroll in the first grade of school in september 2021.
Angrist and Krueger (AK91) is an influental study addressing ability bias.
Idea:
Suppose all children who reach the age of 6 by 31st of december 2021 are required to enroll in the first grade of school in september 2021.
If born in September 2015 (i.e. 6 years prior), will be 5 years and 3/4 by the time they start school.
If born on the 1st of January 2016 will be 6 and 3/4 years when they enter school in september 2022.
However, people can drop out of school legally on their 16-th birthday!
So, out of people who drop out, some got more schooling than others.
AK91 construct IV quarter of birth dummy: affects schooling, but not related to A!
quarter of birth dummy z: affects schooling, but not related to A!
In particular: whether born in 4-th quarter or not.
AK91 allow us to introduce a widely used variation of our simple IV estimator: 2SLS
We estimate a first stage model which uses only exogenous variables (like z) to explain our endgenous regressor s.
We then use the first stage model to predict values of s in what is called the second stage or the reduced form model. Performing this procedure is supposed to take out any impact of A in the correlation we observe in our data between s and y.
1. Stage: si=α0+α1zi+ηi2. Stage: yi=β0+β1^si+ui
Conditions:
Let's load the data and look at a quick summary
data("ak91", package = "masteringmetrics")# from the modelsummary packagedatasummary_skim(data.frame(ak91),histogram = TRUE)
Unique (#) | Missing (%) | Mean | SD | Min | Median | Max | ||
---|---|---|---|---|---|---|---|---|
lnw | 26732 | 0 | 5.9 | 0.7 | −2.3 | 6.0 | 10.5 | |
s | 21 | 0 | 12.8 | 3.3 | 0.0 | 12.0 | 20.0 | |
yob | 10 | 0 | 1934.6 | 2.9 | 1930.0 | 1935.0 | 1939.0 | |
qob | 4 | 0 | 2.5 | 1.1 | 1.0 | 3.0 | 4.0 | |
sob | 51 | 0 | 30.7 | 14.2 | 1.0 | 34.0 | 56.0 | |
age | 40 | 0 | 45.0 | 2.9 | 40.2 | 45.0 | 50.0 |
We want to create the q4
dummy which is TRUE
if you are born in the 4th quarter.
create factor
versions of quarter and year of birth.
ak91 <- mutate(ak91, qob_fct = factor(qob), q4 = as.integer(qob == "4"), yob_fct = factor(yob))# get mean wage by year/quarterak91_age <- ak91 %>% group_by(qob, yob) %>% summarise(lnw = mean(lnw), s = mean(s)) %>% mutate(q4 = (qob == 4))
Let's reproduce AK91's first figure now on education as a function of quarter of birth!
ggplot(ak91_age, aes(x = yob + (qob - 1) / 4, y = s )) + geom_line() + geom_label(mapping = aes(label = qob, color = q4)) + guides(label = FALSE, color = FALSE) + scale_x_continuous("Year of birth", breaks = 1930:1940) + scale_y_continuous("Years of Education", breaks = seq(12.2, 13.2, by = 0.2), limits = c(12.2, 13.2)) + theme_bw()
The numbers label mean education by quarter of birth groups.
The 4-th quarters did get more education in most years!
There is a general trend.
What about earnings for those groups?
ggplot(ak91_age, aes(x = yob + (qob - 1) / 4, y = lnw)) + geom_line() + geom_label(mapping = aes(label = qob, color = q4)) + scale_x_continuous("Year of birth", breaks = 1930:1940) + scale_y_continuous("Log weekly wages") + guides(label = FALSE, color = FALSE) + theme_bw()
The 4-th quarters are among the high-earners by birth year.
In general, weekly wages seem to decline somewhat over time.
R
Several options (like always with R
! 😉)
Will use the iv_robust
function from the estimatr
package.
Robust? Computes standard errors which are correcting for heteroskedasticity. Details here.
library(estimatr)# create a list of modelsmod <- list()# standard (biased!) OLSmod$ols <- lm(lnw ~ s, data = ak91)# IV: born in q4 is TRUE?# doing IV manually in 2 stages.mod[["1. stage"]] <- lm(s ~ q4, data = ak91)ak91$shat <- predict(mod[["1. stage"]]) mod[["2. stage"]] <- lm(lnw ~ shat, data = ak91)# run 2SLS# doing IV all in one go# notice the formula!# formula = y ~ x | zmod$`2SLS` <- iv_robust(lnw ~ s | q4, data = ak91, diagnostics = TRUE)
R
Several options (like always with R
! 😉)
Will use the iv_robust
function from the estimatr
package.
Robust? Computes standard errors which are correcting for heteroskedasticity. Details here.
Notice the predict
to get ^s.
library(estimatr)# create a list of modelsmod <- list()# standard (biased!) OLSmod$ols <- lm(lnw ~ s, data = ak91)# IV: born in q4 is TRUE?# doing IV manually in 2 stages.mod[["1. stage"]] <- lm(s ~ q4, data = ak91)ak91$shat <- predict(mod[["1. stage"]])mod[["2. stage"]] <- lm(lnw ~ shat, data = ak91)# run 2SLS# doing IV all in one go# notice the formula!# formula = y ~ x | zmod$`2SLS` <- iv_robust(lnw ~ s | q4, data = ak91, diagnostics = TRUE)
ols | 1. stage | 2. stage | 2SLS | |
---|---|---|---|---|
(Intercept) | 4.995*** | 12.747*** | 4.955*** | 4.955*** |
(0.004) | (0.007) | (0.381) | (0.358) | |
s | 0.071*** | 0.074** | ||
(0.000) | (0.028) | |||
q4 | 0.092*** | |||
(0.013) | ||||
shat | 0.074* | |||
(0.030) | ||||
R2 | 0.117 | 0.000 | 0.000 | 0.117 |
RMSE | 0.64 | 3.28 | 0.68 | 0.64 |
1. Stage F: | 48.990 | |||
+ p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 |
OLS likely downward biased (measurement error in schooling)
First Stage: IV q4
is statistically significant, but small effect: born in q4 has 0.092 years of educ. R2 is 0%! But F-stat is large. 😅
Second stage has same point estimate as 2SLS
but different std error (2. stage one is wrong)
We encountered this before: it's useful to test restricted vs unrestricted models against each other.
Here, we are interested whether our instruments are jointly significant. Of course, with only one IV, that's not more informative than the t-stat of that IV.
We encountered this before: it's useful to test restricted vs unrestricted models against each other.
Here, we are interested whether our instruments are jointly significant. Of course, with only one IV, that's not more informative than the t-stat of that IV.
This F-Stat compares the predictive power of the first stage with and without the IVs. If they have very similar predictive power, the F-stat will be low, and we will not be able to reject the H0 that our IVs are jointly insignificant in the first stage model. 😞
We saw a clear time trend in education earlier.
There are also business-cycle fluctuations in earnings
We should somehow control for different time periods.
Also, we can use more than one IV! Here is how:
# we keep adding to our `mod` list:mod$ols_yr <- update(mod$ols, . ~ . + yob_fct) # previous OLS model# add exogenous vars on both sides of the `|` !mod[["2SLS_yr"]] <- estimatr::iv_robust(lnw ~ s + yob_fct | q4 + yob_fct, data = ak91, diagnostics = TRUE ) # use all quarters as IVsmod[["2SLS_all"]] <- estimatr::iv_robust(lnw ~ s + yob_fct | qob_fct + yob_fct, data = ak91, diagnostics = TRUE )
ols | 2SLS | ols_yr | 2SLS_yr | 2SLS_all | |
---|---|---|---|---|---|
(Intercept) | 4.995 | 4.955 | 5.017 | 4.966 | 4.592 |
(0.004) | (0.358) | (0.005) | (0.354) | (0.251) | |
s | 0.071 | 0.074 | 0.071 | 0.075 | 0.105 |
(0.000) | (0.028) | (0.000) | (0.028) | (0.020) | |
R2 | 0.117 | 0.117 | 0.118 | 0.117 | 0.091 |
RMSE | 0.64 | 0.64 | 0.64 | 0.64 | 0.65 |
1. Stage F: | 48.990 | 47.731 | 32.323 | ||
Instruments | none | Q4 | none | Q4 | All Quarters |
Year of birth | no | no | yes | yes | yes |
ols | 2SLS | ols_yr | 2SLS_yr | 2SLS_all | |
---|---|---|---|---|---|
(Intercept) | 4.995 | 4.955 | 5.017 | 4.966 | 4.592 |
(0.004) | (0.358) | (0.005) | (0.354) | (0.251) | |
s | 0.071 | 0.074 | 0.071 | 0.075 | 0.105 |
(0.000) | (0.028) | (0.000) | (0.028) | (0.020) | |
R2 | 0.117 | 0.117 | 0.118 | 0.117 | 0.091 |
RMSE | 0.64 | 0.64 | 0.64 | 0.64 | 0.65 |
1. Stage F: | 48.990 | 47.731 | 32.323 | ||
Instruments | none | Q4 | none | Q4 | All Quarters |
Year of birth | no | no | yes | yes | yes |
Adding year controls...
Using all quarters as IV...
This will produce consistent estimates if
How does the QOB perform along those lines?
This will produce consistent estimates if
How does the QOB perform along those lines?
Plot of first stage and high F-stat offer compelling evidence for relevance. ✅
Is QOB independent of, say, maternal characteristics? Birthdays are not really random - there are birth seasons for certain socioeconomic backgrounds. highest maternal schooling give birth in second quarter. (not in 4th! ✅)
Exclusion: What if the youngest kids (born in Q4!) are the disadvantaged ones early on, which has long-term negative impacts? That would mean E[u|z]≠0! Well, with QOB the youngest ones actually do better (more schooling and higher wage)! ✅
Let's go back to our simple linear model:
y=β0+β1x+u
where we fear that Cov(x,u)≠0, x is endogenous.
z
Conditions for IV
- first stage or relevance: Cov(z,x)≠0
- IV exogeneity: Cov(z,u)=0: the IV is exogenous in the outcome equation.
How does this identify β1?
(How can we express β1 in terms of population moments to pin it's value down?)
Cov(z,y)=Cov(z,β0+β1x+u)=β1Cov(z,x)+Cov(z,u)
Under condition 2. above (IV exogeneity), we have Cov(z,u)=0, hence
Cov(z,y)=β1Cov(z,x)
Cov(z,y)=Cov(z,β0+β1x+u)=β1Cov(z,x)+Cov(z,u)
Under condition 2. above (IV exogeneity), we have Cov(z,u)=0, hence
Cov(z,y)=β1Cov(z,x)
and under condition 1. (relevance), we have Cov(z,x)≠0, so that we can divide the equation through to obtain
β1=Cov(z,y)Cov(z,x).
β1 is identified via population moments Cov(z,y) and Cov(z,x).
We can estimate those moments via their sample analogs
Just plugging in for the population moments:
^β1=∑ni=1(zi−¯z)(yi−¯y)∑ni=1(zi−¯z)(xi−¯x)
Just plugging in for the population moments:
^β1=∑ni=1(zi−¯z)(yi−¯y)∑ni=1(zi−¯z)(xi−¯x)
The intercept estimate is ^β0=¯y−^β1¯x
Given both assumptions 1. and 2. are satisfied, we say that the IV estimator is consistent for β1. We write
plim(^β1)=β1
in words: the probability limit of ^β1 is the true β1.
Assuming E(u2|z)=σ2 the variance of the IV slope estimator is
Var(^β1,IV)=σ2nσ2xρ2x,z
σ2x is the population variance of x,
σ2 the one of u, and
ρx,z is the population correlation between x and z.
Assuming E(u2|z)=σ2 the variance of the IV slope estimator is
Var(^β1,IV)=σ2nσ2xρ2x,z
σ2x is the population variance of x,
σ2 the one of u, and
ρx,z is the population correlation between x and z.
You can see 2 important things here:
Var(^β1,IV)=σ2nσ2xR2x,z
Var(^β1,IV)=σ2nσ2xR2x,z
Given R2x,z<1 in most real life situations, we have that Var(^β1,IV)>Var(^β1,OLS) almost certainly.
The higher the correlation between z and x, the closer their R2x,z is to 1. With R2x,z=1 we get back to the OLS variance. This is no surprise, because that implies that in fact z=x.
So, if you have a valid, exogenous regressor x, you should not perform IV estimation using z to obtain ^β, since your variance will be unnecessarily large.
Consider the following model for married women's wages:
logwage=β0+β1educ+u Let's run an OLS on this, and then compare it to an IV estimate using father's education. Keep in mind that this is a valid IV z if
data(mroz,package = "wooldridge")mods = list()mods$OLS <- lm(lwage ~ educ, data = mroz)mods[['First Stage']] <- lm(educ ~ fatheduc, data = subset(mroz, inlf == 1))mods$IV <- estimatr::iv_robust(lwage ~ educ | fatheduc, data = mroz)
OLS | First Stage | IV | |
---|---|---|---|
(Intercept) | -0.185 | 10.237 | 0.441 |
(0.185) | (0.276) | (0.467) | |
educ | 0.109 | 0.059 | |
(0.014) | (0.037) | ||
fatheduc | 0.269 | ||
(0.029) | |||
Num.Obs. | 428 | 428 | 428 |
R2 | 0.118 | 0.173 | 0.093 |
IV is consistent under given assumptions.
However, even if we have only very small Cor(z,u), we can get wrong-footed
Small corrleation between x and z can produce inconsistent estimates.
plim(^β1,IV)=β1+Cor(z,u)Cor(z,x)⋅σuσx
IV is consistent under given assumptions.
However, even if we have only very small Cor(z,u), we can get wrong-footed
Small corrleation between x and z can produce inconsistent estimates.
plim(^β1,IV)=β1+Cor(z,u)Cor(z,x)⋅σuσx
To illustrate this point, let's assume we want to look at the impact of number of packs of cigarettes smoked per day by pregnant women (packs) on the birthweight of their child (bwght):
log(bwght)=β0+β1packs+u
We are worried that smoking behavior is correlated with a range of other health-related variables which are in u and which could impact the birthweight of the child. So we look for an IV. Suppose we use the price of cigarettes (cigprice), assuming that the price of cigarettes is uncorrelated with factors in u. Let's run the first stage of cigprice on packs and then let's show the 2SLS estimates:
data(bwght, package = "wooldridge")mods <- list()mods[["First Stage"]] <- lm(packs ~ cigprice, data = bwght)mods[["IV"]] <- estimatr::iv_robust(log(bwght) ~ packs | cigprice, data = bwght, diagnostics = TRUE)
First Stage | IV | |
---|---|---|
(Intercept) | 0.067 | 4.448 |
(0.103) | (0.940) | |
cigprice | 0.000 | |
(0.001) | ||
packs | 2.989 | |
(8.996) | ||
R2 | 0.000 | -23.230 |
RMSE | 0.30 | |
1. Stage F: | 0.121 |
The first columns shows: very weak first stage. cigprice has zero impact on packs it seems!
R2 is zero.
What is we use this IV nevertheless?
The first columns shows: very weak first stage. cigprice has zero impact on packs it seems!
R2 is zero.
What is we use this IV nevertheless?
in the second column: very large, positive(!) impact of packs smoked on birthweight. 🤔
Huge Standard Error though.
An R2 of -23?!
F-stat of first stage: 0.121. Corresponds to a p-value of 0.728 : we cannot reject the H0 of an insignificant first stage here at all.
So: invalid approach. ❌
We learned about John Snow's grand experiment in London 1850.
We used his story to motivate the IV estimator.
You took a quiz about some IV aspects.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |