Last time, we refreshed our basic OLS knowledge
Today we continue and look at more than one explanatory variable, and associated problems
But, why more than one variable?
Like, how many other variables?
And, above all: which ones ? 🤔
Last time, we refreshed our basic OLS knowledge
Today we continue and look at more than one explanatory variable, and associated problems
But, why more than one variable?
Like, how many other variables?
And, above all: which ones ? 🤔
We will remember what we meant by a model.
Remember what we learned about the STAR Experiment
What is the causal impact of class size on test scores?
scorei=β0+β1classizei+ui ?
We use a model to order our thoughts about how a causal impact is determined.
Remember what we learned about the STAR Experiment
What is the causal impact of class size on test scores?
scorei=β0+β1classizei+ui ?
We use a model to order our thoughts about how a causal impact is determined.
Let's augment our model with more variables:
y=β0+β1x1+β2x2+β3x3+u
Omitted-variable bias (OVB) arises when we omit a variable that
affects our outcome variable y
correlates with an explanatory variable xj
As it's name suggests, this situation leads to bias in our estimate of βj.
Omitted-variable bias (OVB) arises when we omit a variable that
affects our outcome variable y
correlates with an explanatory variable xj
As it's name suggests, this situation leads to bias in our estimate of βj.
Note: OVB Is not exclusive to multiple linear regression, but it does require multiple variables affect y.
Example
Let's imagine a simple model for the amount individual i gets paid
Payi=β0+β1Schooli+β2Malei+ui
where
thus
Example, continued
From our population model
Payi=β0+β1Schooli+β2Malei+ui
If a study focuses on the relationship between pay and schooling, i.e.,
Payi=β0+β1Schooli+(β2Malei+ui) Payi=β0+β1Schooli+εi
where εi=β2Malei+ui.
We used our exogeneity assumption to derive OLS' unbiasedness. But even if E[u|X]=0, it is not true that E[ε|X]=0 so long as β2≠0.
Specifically, E[ε|Male=1]=β2+E[u|Male=1]≠0.
Example, continued
From our population model
Payi=β0+β1Schooli+β2Malei+ui
If a study focuses on the relationship between pay and schooling, i.e.,
Payi=β0+β1Schooli+(β2Malei+ui) Payi=β0+β1Schooli+εi
where εi=β2Malei+ui.
We used our exogeneity assumption to derive OLS' unbiasedness. But even if E[u|X]=0, it is not true that E[ε|X]=0 so long as β2≠0.
Specifically, E[ε|Male=1]=β2+E[u|Male=1]≠0. Now OLS is biased.
Example, continued
Let's try to see this result graphically.
The population model:
Payi=20+0.5×Schooli+10×Malei+ui
Our regression model that suffers from omitted-variable bias:
Payi=^β0+^β1×Schooli+ei
Finally, imagine that women, on average, receive more schooling than men.
Example, continued: Payi=20+0.5×Schooli+10×Malei+ui
The relationship between pay and schooling.
Example, continued: Payi=20+0.5×Schooli+10×Malei+ui
Biased regression estimate: ˆPayi=31.3+−0.9×Schooli
Example, continued: Payi=20+0.5×Schooli+10×Malei+ui
Recalling the omitted variable: Gender female and male
Example, continued: Payi=20+0.5×Schooli+10×Malei+ui
Recalling the omitted variable: Gender female and male
Example, continued: Payi=20+0.5×Schooli+10×Malei+ui
Unbiased regression estimate: ˆPayi=20.9+0.4×Schooli+9.1×Malei
Don't omit variables 😜
Instrumental variables and two-stage least squares (coming soon): If we could find something that only affects x1 but not the omitted variable, we can make progress!
Use multiple observations for the same unit i: panel data.
Warning: There are situations in which neither solution is possible.
Don't omit variables 😜
Instrumental variables and two-stage least squares (coming soon): If we could find something that only affects x1 but not the omitted variable, we can make progress!
Use multiple observations for the same unit i: panel data.
Warning: There are situations in which neither solution is possible.
Proceed with caution (sometimes you can sign the bias).
The key is to have a mental map of should belong to the model.
Consider the relationship
Payi=β0+β1Schooli+ui
where
Interpretations
Consider the model
y=β0+β1x+u
Differentiate the model:
dydx=β1
Load the wage1
dataset from the wooldridge
package. you may have to install this first.
Run skimr::skim
on the dataset to get an overview. what is the fraciton of nonwhite in the data?
Regressing wage on education and tenure, what is the interpretation of the tenure
coefficient? You may need to consult ?wage1
here.
Consider the relationship
Payi=β0+β1Femalei+ui
where
Interpretations
Derivations
E[Pay|Male]=E[β0+β1×0+ui]=E[β0+0+ui]=β0
Derivations
E[Pay|Male]=E[β0+β1×0+ui]=E[β0+0+ui]=β0
E[Pay|Female]=E[β0+β1×1+ui]=E[β0+β1+ui]=β0+β1
Derivations
E[Pay|Male]=E[β0+β1×0+ui]=E[β0+0+ui]=β0
E[Pay|Female]=E[β0+β1×1+ui]=E[β0+β1+ui]=β0+β1
Note: If there are no other variables to condition on, then ^β1 equals the difference in group means, e.g., ¯¯¯xFemale−¯¯¯xMale.
yi=β0+β1xi+ui for binary variable xi={0,1}
yi=β0+β1xi+ui for binary variable xi={0,1}
Continue with the wage1
dataset.
Now regress wage
on female
. What is E[wage|male]?
Add married
to the regression. Now what is E[wage|female,not married]?
Interactions allow the effect of one variable to change based upon the level of another variable.
Examples
Does the effect of schooling on pay change by gender?
Does the effect of gender on pay change by race?
Does the effect of schooling on pay change by experience?
Previously, we considered a model that allowed women and men to have different wages, but the model assumed the effect of school on pay was the same for everyone:
Payi=β0+β1Schooli+β2Femalei+ui
but we can also allow the effect of school to vary by gender:
Payi=β0+β1Schooli+β2Femalei+β3Schooli×Femalei+ui
The model where schooling has the same effect for everyone (F and M):
The model where schooling's effect can differ by gender (F and M):
Interpreting coefficients can be a little tricky with interactions, but the key† is to carefully work through the math.
† As is often the case with econometrics.
Payi=β0+β1Schooli+β2Femalei+β3Schooli×Femalei+ui
Expected returns for an additional year of schooling for women:
E[Payi|Female∧School=ℓ+1]−E[Payi|Female∧School=ℓ]=E[β0+β1(ℓ+1)+β2+β3(ℓ+1)+ui]−E[β0+β1ℓ+β2+β3ℓ+ui]=β1+β3
Interpreting coefficients can be a little tricky with interactions, but the key† is to carefully work through the math.
† As is often the case with econometrics.
Payi=β0+β1Schooli+β2Femalei+β3Schooli×Femalei+ui
Expected returns for an additional year of schooling for women:
E[Payi|Female∧School=ℓ+1]−E[Payi|Female∧School=ℓ]=E[β0+β1(ℓ+1)+β2+β3(ℓ+1)+ui]−E[β0+β1ℓ+β2+β3ℓ+ui]=β1+β3
Similarly, β1 gives the expected return to an additional year of schooling for men. Thus, β3 gives the difference in the returns to schooling for women and men.
Same dataset!
Regress wage on experience, female indicator and their interaction. What is the interpretation of all the coefficients here? Can you distinguish them from zero?
What is the expected wage for a male with 5 years of experience?
In economics, you will frequently see logged outcome variables with linear (non-logged) explanatory variables, e.g.,
log(pricei)=β0+β1bdrmsi+ui
This specification changes our interpretation of the slope coefficients.
data(hprice1,package = "wooldridge")lm(log(price) ~ bdrms, data = hprice1) %>% tidy()
#> # A tibble: 2 × 5#> term estimate std.error statistic p.value#> <chr> <dbl> <dbl> <dbl> <dbl>#> 1 (Intercept) 5.04 0.126 39.9 3.13e-57#> 2 bdrms 0.167 0.0345 4.85 5.43e- 6
Interpretation
A one-unit increase in our explanatory variable increases the outcome variable by approximately β1×100 percent.
Example: An additional bedroom increases sales prices of a house by approximately 16 percent (for β1=0.16).
Consider the log-linear model
log(y)=β0+β1x+u
and differentiate
dyy=β1dx
So a marginal change in x (i.e., dx) leads to a β1dx percentage change in y.
What about that approximation part?
An additional bedroom increases sales prices of a house by approximately 16 percent (for β1=0.16).
%Δy≈0.16×100=16%.
Good approximation as long as Δy is not too big.
We approximate log(Δyy0+1)≈Δyy0
What about that approximation part?
An additional bedroom increases sales prices of a house by approximately 16 percent (for β1=0.16).
%Δy≈0.16×100=16%.
Good approximation as long as Δy is not too big.
We approximate log(Δyy0+1)≈Δyy0
What about that approximation part?
An additional bedroom increases sales prices of a house by approximately 16 percent (for β1=0.16).
%Δy≈0.16×100=16%.
Good approximation as long as Δy is not too big.
We approximate log(Δyy0+1)≈Δyy0
The exact formula is %Δy=100×(exp(Δxβ)−1)
In our case: %Δy=100×(exp(0.16)−1)=17.3
same Dataset!
Now regress log wage on education and tenure. How does the interpretation of the coefficient on education change?
Similarly, econometricians frequently employ log-log models, in which the outcome variable is logged and at least one explanatory variable is logged
log(pricei)=β0+β1log(sqrfti)+ui
Interpretation:
Consider the log-log model
log(y)=β0+β1log(x)+u
and differentiate
dyy=β1dxx
which says that for a one-percent increase in x, we will see a β1 percent increase in y. As an elasticity:
dydxxy=β1
Load the hprice1
dataset from the wooldridge
package.
Regress log price on log sqrft. What is the interpretation on log(sqrft)
?
What is the E[price|sqrft=115] (Caution! not log price
!)
lm(log(price) ~ log(sqrft), data = hprice1) %>% tidy()
#> # A tibble: 2 × 5#> term estimate std.error statistic p.value#> <chr> <dbl> <dbl> <dbl> <dbl>#> 1 (Intercept) -0.975 0.641 -1.52 1.32e- 1#> 2 log(sqrft) 0.873 0.0846 10.3 1.05e-16
a 1% increase in square footage of the house leads to a 0.873% increase in sales price.
Notice the absence of units here (it's all in percent terms of both variables involved).
Note: If you have a log-linear model with a binary indicator variable, the interpretation for the coefficient on that variable changes.
Consider again
log(yi)=β0+β1x1+ui
for binary variable x1.
The approximate interpretation of β1 is as before:
When x1 changes from 0 to 1, y will change by 100×β1 percent.
#> #> Call:#> lm(formula = log(price) ~ log(lotsize) + log(sqrft) + bdrms + #> colonial, data = hprice1)#> #> Residuals:#> Min 1Q Median 3Q Max #> -0.69479 -0.09750 -0.01619 0.09151 0.70228 #> #> Coefficients:#> Estimate Std. Error t value Pr(>|t|) #> (Intercept) -1.34959 0.65104 -2.073 0.0413 * #> log(lotsize) 0.16782 0.03818 4.395 3.25e-05 ***#> log(sqrft) 0.70719 0.09280 7.620 3.69e-11 ***#> bdrms 0.02683 0.02872 0.934 0.3530 #> colonial 0.05380 0.04477 1.202 0.2330 #> ---#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#> #> Residual standard error: 0.1841 on 83 degrees of freedom#> Multiple R-squared: 0.6491, Adjusted R-squared: 0.6322 #> F-statistic: 38.38 on 4 and 83 DF, p-value: < 2.2e-16
Approximate
Exact
Up to this point, we know OLS has some nice properties, and we know how to estimate an intercept and slope coefficient via OLS.
Our current workflow:
But how do we actually learn something from this exercise?
This is related to Intro Course material:
But how do we actually learn something from this exercise?
We need to be able to deal with uncertainty. Enter: Inference.
As our previous simulation pointed out, our problem with uncertainty is that we don't know whether our sample estimate is close or far from the unknown population parameter.†
However, all is not lost. We can use the errors (ei=yi−^yi) to get a sense of how well our model explains the observed variation in y.
When our model appears to be doing a "nice" job, we might be a little more confident in using it to learn about the relationship between y and x.
Now we just need to formalize what a "nice job" actually means.
†: Except when we run the simulation ourselves—which is why we like simulations.
First off, we will estimate the variance of ui (recall: Var(ui)=σ2) using our squared errors, i.e.,
s2=∑ie2in−k
where k gives the number of slope terms and intercepts that we estimate (e.g., β0 and β1 would give k=2).
s2 is an unbiased estimator of σ2.
We know that the variance of ^β1 (for simple linear regression) is
Var(^β1)=s2∑i(xi−¯¯¯x)2
which shows that the variance of our slope estimator
More common: The standard error of ^β1
^SE(^β1)= ⎷s2∑i(xi−¯¯¯x)2
Recall: The standard error of an estimator is the standard deviation of the estimator's distribution.
Standard error output is standard in R
's lm
:
#> # A tibble: 2 × 5#> term estimate std.error statistic p.value#> <chr> <dbl> <dbl> <dbl> <dbl>#> 1 (Intercept) 2.53 0.422 6.00 3.38e- 8#> 2 x 0.567 0.0793 7.15 1.59e-10
We use the standard error of ^β1, along with ^β1 itself, to learn about the parameter β1.
After deriving the distribution of ^β1,† we have two (related) options for formal statistical inference (learning) about our unknown parameter β1:
Confidence intervals: Use the estimate and its standard error to create an interval that, when repeated, will generally†† contain the true parameter.
Hypothesis tests: Determine whether there is statistically significant evidence to reject a hypothesized value or range of values.
†: Hint: it's normal with the mean and variance we've derived/discussed above)
††: E.g., Similarly constructed 95% confidence intervals will contain the true parameter 95% of the time.
We construct (1−α)-level confidence intervals for β1 ^β1±tα/2,df^SE(^β1)
tα/2,df denotes the α/2 quantile of a t dist. with n−k degrees of freedom.
We construct (1−α)-level confidence intervals for β1 ^β1±tα/2,df^SE(^β1)
For example, 100 obs., two coefficients (i.e., ^β0 and ^β1⟹k=2), and α=0.05 (for a 95% confidence interval) gives us t0.025,98=−1.98
We construct (1−α)-level confidence intervals for β1 ^β1±tα/2,df^SE(^β1)
Example:
lm(y ~ x, data = pop_df) %>% tidy(conf.int = TRUE)
#> # A tibble: 2 × 7#> term estimate std.error statistic p.value conf.low conf.high#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>#> 1 (Intercept) 2.53 0.422 6.00 3.38e- 8 1.69 3.37 #> 2 x 0.567 0.0793 7.15 1.59e-10 0.410 0.724
We construct (1−α)-level confidence intervals for β1 ^β1±tα/2,df^SE(^β1)
Example:
lm(y ~ x, data = pop_df) %>% tidy(conf.int = TRUE)
#> # A tibble: 2 × 7#> term estimate std.error statistic p.value conf.low conf.high#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>#> 1 (Intercept) 2.53 0.422 6.00 3.38e- 8 1.69 3.37 #> 2 x 0.567 0.0793 7.15 1.59e-10 0.410 0.724
Our 95% confidence interval is thus 0.567±1.98×0.0793=[0.410,0.724]
So we have a confidence interval for β1, i.e., [0.410,0.724].
What does it mean?
So we have a confidence interval for β1, i.e., [0.410,0.724].
What does it mean?
Informally: The confidence interval gives us a region (interval) in which we can place some trust (confidence) for containing the parameter.
So we have a confidence interval for β1, i.e., [0.410,0.724].
What does it mean?
Informally: The confidence interval gives us a region (interval) in which we can place some trust (confidence) for containing the parameter.
More formally: If repeatedly sample from our population and construct confidence intervals for each of these samples, (1−α) percent of our intervals (e.g., 95%) will contain the population parameter somewhere in the interval.
So we have a confidence interval for β1, i.e., [0.410,0.724].
What does it mean?
Informally: The confidence interval gives us a region (interval) in which we can place some trust (confidence) for containing the parameter.
More formally: If repeatedly sample from our population and construct confidence intervals for each of these samples, (1−α) percent of our intervals (e.g., 95%) will contain the population parameter somewhere in the interval.
Now back to our simulation...
We drew 10,000 samples (each of size n=30) from our population and estimated our regression model for each of these simulations:
yi=^β0+^β1xi+ei
Now, let's estimate 95% confidence intervals for each of these intervals...
From our previous simulation: 97.8% of our 95% confidences intervals contain the true parameter value of β1.
That's a probabilistic statement:
In many applications, we want to know more than a point estimate or a range of values. We want to know what our statistical evidence says about existing theories.
We want to test hypotheses posed by officials, politicians, economists, scientists, friends, weird neighbors, etc.
Examples
Hypothesis testing relies upon very similar results and intuition.
While uncertainty certainly exists, we can still build reliable statistical tests (rejecting or failing to reject a posited hypothesis).
Hypothesis testing relies upon very similar results and intuition.
While uncertainty certainly exists, we can still build reliable statistical tests (rejecting or failing to reject a posited hypothesis).
OLS t test Our (null) hypothesis states that β1 equals a value c, i.e., Ho:β1=c
From OLS's properties, we can show that the test statistic
tstat=^β1−c^SE(^β1)
follows the t distribution with n−k degrees of freedom.
For an α-level, two-sided test, we reject the null hypothesis (and conclude with the alternative hypothesis) when
|tstat|>∣∣t1−α/2,df∣∣
meaning that our test statistic is more extreme than the critical value.
Alternatively, we can calculate the p-value that accompanies our test statistic, which effectively gives us the probability of seeing our test statistic or a more extreme test statistic if the null hypothesis were true.
Very small p-values (generally < 0.05) mean that it would be unlikely to see our results if the null hyopthesis were really true—we tend to reject the null for p-values below 0.05.
R
and stata
s default to testing hypotheses against the value zero.
lm(y ~ x, data = pop_df) %>% tidy()
#> # A tibble: 2 × 5#> term estimate std.error statistic p.value#> <chr> <dbl> <dbl> <dbl> <dbl>#> 1 (Intercept) 2.53 0.422 6.00 3.38e- 8#> 2 x 0.567 0.0793 7.15 1.59e-10
R
and stata
s default to testing hypotheses against the value zero.
lm(y ~ x, data = pop_df) %>% tidy()
#> # A tibble: 2 × 5#> term estimate std.error statistic p.value#> <chr> <dbl> <dbl> <dbl> <dbl>#> 1 (Intercept) 2.53 0.422 6.00 3.38e- 8#> 2 x 0.567 0.0793 7.15 1.59e-10
Ho: β1=0 vs. Ha: β1≠0
R
and stata
s default to testing hypotheses against the value zero.
lm(y ~ x, data = pop_df) %>% tidy()
#> # A tibble: 2 × 5#> term estimate std.error statistic p.value#> <chr> <dbl> <dbl> <dbl> <dbl>#> 1 (Intercept) 2.53 0.422 6.00 3.38e- 8#> 2 x 0.567 0.0793 7.15 1.59e-10
Ho: β1=0 vs. Ha: β1≠0
tstat=7.15 and t0.975, 28=2.05
R
and stata
s default to testing hypotheses against the value zero.
lm(y ~ x, data = pop_df) %>% tidy()
#> # A tibble: 2 × 5#> term estimate std.error statistic p.value#> <chr> <dbl> <dbl> <dbl> <dbl>#> 1 (Intercept) 2.53 0.422 6.00 3.38e- 8#> 2 x 0.567 0.0793 7.15 1.59e-10
Ho: β1=0 vs. Ha: β1≠0
tstat=7.15 and t0.975, 28=2.05 which implies p-value <0.05
R
and stata
s default to testing hypotheses against the value zero.
lm(y ~ x, data = pop_df) %>% tidy()
#> # A tibble: 2 × 5#> term estimate std.error statistic p.value#> <chr> <dbl> <dbl> <dbl> <dbl>#> 1 (Intercept) 2.53 0.422 6.00 3.38e- 8#> 2 x 0.567 0.0793 7.15 1.59e-10
Ho: β1=0 vs. Ha: β1≠0
tstat=7.15 and t0.975, 28=2.05 which implies p-value <0.05
Therefore, we reject Ho.
You will sometimes see F tests in econometrics.
We use F tests to test hypotheses that involve multiple parameters
(e.g., β1=β2 or β3+β4=1),
rather than a single simple hypothesis
(e.g., β1=0, for which we would just use a t test).
Example
Economists love to say "Money is fungible."
Imagine that we might want to test whether money received as income actually has the same effect on consumption as money received from tax rebates/returns.
Consumptioni=β0+β1Incomei+β2Rebatei+ui
Example, continued
We can write our null hypothesis as
Ho:β1=β2⟺Ho:β1−β2=0
Imposing this null hypothesis gives us the restricted model
Consumptioni=β0+β1Incomei+β1Rebatei+ui Consumptioni=β0+β1(Incomei+Rebatei)+ui
Example, continued
To this the null hypothesis Ho:β1=β2 against Ha:β1≠β2,
we use the F statistic
Fq,n−k−1=(SSEr−SSEu)/qSSEu/(n−k−1)
which (as its name suggests) follows the F distribution with q numerator degrees of freedom and n−k−1 denominator degrees of freedom.
Here, q is the number of restrictions we impose via Ho.
Example, continued
The term SSEr is the sum of squared errors (SSE) from our restricted model Consumptioni=β0+β1(Incomei+Rebatei)+ui
and SSEu is the sum of squared errors (SSE) from our unrestricted model Consumptioni=β0+β1Incomei+β2Rebatei+ui
Last time, we refreshed our basic OLS knowledge
Today we continue and look at more than one explanatory variable, and associated problems
But, why more than one variable?
Like, how many other variables?
And, above all: which ones ? 🤔
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |