ScPoEconometrics: Advanced
Intro to Statistical Learning
Bluebery Planterose
SciencesPo Paris 
 2023-04-11
1 / 33

Intro to Statistical Learning: ISLR

This set of slides is based on the amazing book An introdution to statistical learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.
I'll freely use some of their plots. They say that is ok if I put:

Some of the figures in this presentation are taken from "An Introduction to Statistical Learning, with applications in R" (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
Thanks so much for putting that resource online for free.
We will try to look at their material with our econometrics background. It's going to be fun!

2 / 33

What is Statistical Learning?

We want to learn the relationship Y ~ X, where X has $p$ components.
We assume a general form like $Y = f (X) + ϵ$
$f$ is a fixed function, but we don't know what it looks like.
We want an estimate $\hat{f}$ for it.

3 / 33

What is Statistical Learning?

We want to learn the relationship Y ~ X, where X has $p$ components.
We assume a general form like $Y = f (X) + ϵ$
$f$ is a fixed function, but we don't know what it looks like.
We want an estimate $\hat{f}$ for it.

Assume $E [ϵ | x] = 0$ !
I.e. we assume to have an identified model
We have done this 👈 many times before already.
But we restricted ourselves to OLS estimation. There are so many ways to estimate $f$ !

3 / 33

An Example of $f$

The blue shape is true relationship $f$
Red dots are observed data: $Y$
Red dots are off blue shape because of $ϵ$

4 / 33

What Do You Want To Do with your $\hat{f}$ ?

Fundamental Difference: (🚨 slight exaggerations ahead!)

Prediction (Machine Learning, AI)

generate $\hat{Y} = \hat{f} (X)$
$\hat{f}$ is a black box
We don't know or care why it works as long as the prediction is good

5 / 33

What Do You Want To Do with your $\hat{f}$ ?

Fundamental Difference: (🚨 slight exaggerations ahead!)

Prediction (Machine Learning, AI)

generate $\hat{Y} = \hat{f} (X)$
$\hat{f}$ is a black box
We don't know or care why it works as long as the prediction is good

Inference (ECON)

Why does $Y$ respond to $X$ ? (Causality)
How does $Y$ respond to $X_{p}$ ? Interpret parameter estimates
$\hat{f}$ is not a black box.
(Out of sample) Prediction often secondary concern.

5 / 33

What makes a Good prediction?

Remember the data generating process (DGP): $Y = f (X) + ϵ$

There are two (!) Errors:
1. Reducible error $\hat{f}$
2. Irredicuble error $ϵ$
We can work to improve the Reducible error
The Irreducible error is a feature of the DGP, hence, nature. Life. Karma. Measurement incurs error.

6 / 33

What makes a Good prediction?

Remember the data generating process (DGP): $Y = f (X) + ϵ$

There are two (!) Errors:
1. Reducible error $\hat{f}$
2. Irredicuble error $ϵ$
We can work to improve the Reducible error
The Irreducible error is a feature of the DGP, hence, nature. Life. Karma. Measurement incurs error.

The squared error for a given estimate $\hat{f}$ is $E [Y - \hat{Y}]^{2}$ : Similar to mean squared residuals!
One can easily show that that this factors as $\begin{aligned} E & [f (X) + ϵ - \hat{f} (X)]^{2} = \\ [\underset{Reducible}{\underset{⏟}{f (X) - \hat{f} (X)}}]^{2} + \underset{Irreducible}{\underset{⏟}{V a r (ϵ)}} \end{aligned}$

6 / 33

First Classification of Estimators

In general:

Nonlinear Models

More nonlinear models are able to get closer to the data.
Hence, they are good predictors
But hard to interpret

Linear Models

Easy to Interpret
Less tight fit to data
worse Prediction

7 / 33

How to Estimate an $f$ ?

Training Data

$n$ data points $i = 1, . . ., n$
$y_{i}$ is $i$ 's response
$X_{i} = (x_{i 1}, \dots, x_{i p})$ are predictors
Data: $(X_{1}, y_{1}), \dots, (X_{n}, y_{n})$

(Up until now, training data was the only data we have encountered!)

Estimate $\hat{f}$ = Learn $\hat{f}$

There are two broad classes of learning $\hat{f}$ :

Parametric Learning
Non-Parametric Learning

8 / 33

Parametric Methods

9 / 33

Parametrics Methods

Procedure

We make a parametric assumption, i.e. we write down how we think $f$ looks like. E.g. $Y = β_{0} + β_{1} x_{1} + \dots + β_{p} x_{p}$ Here we only have to find $p + 1$ numbers!
We train the model, i.e. we choose the $β$ 's. We are pretty good at that -> OLS ✌️

Potential Issues

Typically, our model is not the true DGP. Why we want a model in the first place.
If our parametric assumption is a poor model of the true DGP, we will be far away from the truth. Kind of...logical.

10 / 33

A Parametric Model for $f$

The yellow plane is $\hat{f}$ : $y = β_{0} + β_{1} educ + β_{2} sen$
It's easy to interpret (need only 3 $β$ 's to draw this!)
Incurs substantial training error because it's a rigid plane (go back to blue shape to check true $f$ ).

11 / 33

Non-Parametric Methods

We make a no explicit assumption about functional form.
We try to get as close as possible to the data points.
We try to do that under some contraints like:
- Not too rough
- Not too wiggly

Usually provides a good fit to the training data.
But it does not reduce the number of parameters!
Quite the contrary. The number of parameters increases so fast that those methods quickly run into feasibility issues (your computer can't run the model!)

12 / 33

A Non-Parametric Model for $f$

The yellow plane is a thin-plate spline
This clearly captures the shape of the true $f$ (the blue one) better: Smaller Training Error.
But it's harder to interpret. Is income increasing with Seniority?

13 / 33

Overfitting: Choosing Smoothness

We can choose the degree of flexibility or smoothness of our spline surface.
Here we increased flexibility so much that there is zero training error: spline goes through all points!
But it's a much wigglier surface now than before! Even harder to interpret.

14 / 33

Overfitting: Choosing Smoothness

Smooth, not wiggly

Smooth but high variance (wiggly!)

15 / 33

Overfitting: Over-doing it

You can see that the researcher has an active choice to make here: how smooth?
Parameters which guide choices like that are called tuning parameters.
As $\hat{f}$ becomes too variable, we say there is overfitting: The model tries too hard to fit patterns in the data, which are not part of the true $f$ !

16 / 33

What Method To Aim For?

Why would we not always want the most flexible method available?

that's a reasonable question to ask.
The previous slide already gave a partial answer: more flexbility generally leads to more variability.
If we want to use our model outside of our training data set, that's an issue.

17 / 33

Classifying Methods 1: flexibility vs interpretability

This graph offers a nice classification of statistical learning methods in flexibility vs interpretability space.
Sometimes it's obvious what the right choice is for your application.
But often it's not. It's a more complicated tradeoff than the picture suggests.
(It's a very helpful picture!)
We will only be touching upon a small number of those. They are all nicely treated in the ISLR book though!

18 / 33

Classifying Methods 2: Supervised vs Unsupervised Learning

Supervised Learning

We have measures of input $x$ and output $y$
We could predict new $y$ 's
Or infer things about Y ~ X
Regression or Classification are typical tasks

Unsupervised Learning

We have no measure of output $y$ !
Only a bunch of $x$ 's
We are interested in grouping of those $x$ (cluster analysis)

19 / 33

Clustering Example

Sometimes clustering is easy: in the left panel the data fall naturally into groups.
When data overlap, it's harder: right panel

20 / 33

Assessing Model Accuracy

What is a good model?

21 / 33

Quality of Fit: the Mean Squard Error

We know the mean squared error (MSE) already: $M S E = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - \hat{f} (x_{i}))^{2}$
We encountered the closely related sum of squared residuals (SSR): $S S R = \sum_{i = 1}^{n} (y_{i} - \hat{f} (x_{i}))^{2}$
As we know, OLS minimizes the SSR. (minimizing SSR or MSE yields the same OLS estimates.)

22 / 33

Quality of Fit: the Mean Squard Error

We know the mean squared error (MSE) already: $M S E = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - \hat{f} (x_{i}))^{2}$
We encountered the closely related sum of squared residuals (SSR): $S S R = \sum_{i = 1}^{n} (y_{i} - \hat{f} (x_{i}))^{2}$
As we know, OLS minimizes the SSR. (minimizing SSR or MSE yields the same OLS estimates.)

However, what MSE 👈 really is: it's the training MSE! It's computed using the same data we used to compute $\hat{f}$ !
Suppose we used data on last 6 months of stock market prices and we want to predict future prices. We don't really care how well we can predict the past prices.
In general, we care about how $\hat{f}$ will perform on unseen data. We call this test data.

22 / 33

Training MSE vs Test MSE

Training

We have a training data set ${(y_{1}, x_{1}), \dots, (y_{n}, x_{n})}$
we use those $n$ observations to find the function $q$ that minimizes the Training MSE: $\hat{f} = \arg min_{q} M S E = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - q (x_{i}))^{2}$

23 / 33

Training MSE vs Test MSE

Training

We have a training data set ${(y_{1}, x_{1}), \dots, (y_{n}, x_{n})}$
we use those $n$ observations to find the function $q$ that minimizes the Training MSE: $\hat{f} = \arg min_{q} M S E = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - q (x_{i}))^{2}$

Testing

We want to know whether $\hat{f}$ will perform well on new data.
Suppose $(y_{0}, x_{0})$ is unseen data - in particular, we haven't used it to train our model!
We want to know the magnitude of the test MSE: $E [(y_{0} - \hat{f} (x_{0}))^{2}]$

23 / 33

A Problem of MSEs

In many cases we don't have a true test data set at hand.
Most methods therefore try to minimize the training MSE. (OLS does!)
At first sight this seems really reasonable.

24 / 33

A Problem of MSEs

In many cases we don't have a true test data set at hand.
Most methods therefore try to minimize the training MSE. (OLS does!)
At first sight this seems really reasonable.

The problem is that test and training MSE are less closely related than one might think!
Very small training MSEs might go together with pretty big test MSEs!
That is, most methods are really good at fitting the training data, but they fail to generalize to outside of that set of point!

24 / 33

Simulation: We know the test data!

In an artifical setting we now the test data because we know the true $f$ .
Here Solid black line. 👉
Increasing flexibility mechanically reduces training error (grey curve in right panel.)
However not the test MSE, in general (red curve!)

25 / 33

Simulation: App!

Let's look at our app online or ScPoApps::launchApp("bias_variance_tradeoff")

26 / 33

So! A Tradeoff at Last!

What's going on here?
Initially, increasing flexibility provides a better fit to the observed data points, decreasing the training error.
That means that also the test error decreases for a while.
As soon as we start overfitting the data points, though, the test error starts to increase again!
At very high flexibility, our method tries to fit patterns in the data which are not part of the true $f$ (the black line)!
To make matters worse, the extent of this phenomenon will depend on the shape of the underlying true $f$ !

27 / 33

Almost linear $f$

In this example, the true $f$ is almost linear.
The inflexible method does well!
Increasing flexibility incurs large testing MSE.

28 / 33

Highly Non-linear $f$

In this example, the true $f$ is very non linear.
The inflexible method does very poorly in both trainign and testing MSE.
the model at 10 degrees of freedom performs best here.
👉 You can see that the best model is not obvious to choose!

29 / 33

Formalizing the Bias-Variance-Tradeoff

We can decompose the expected test MSE as follows: $E (y_{0} - \hat{f} (x_{0}))^{2} = V a r (\hat{f} (x_{0})) + {[Bias (\hat{f} (x_{0}))]}^{2} + V a r (ϵ)$
From this we can see that we have to minimize both variance and bias when chooseing a suitable method.
We have seen before that those are competing forces in some situations.
Notice that the best we could achieve is $V a r (ϵ) > 0$ since that is a feature of our DGP.

30 / 33

Bias-Variance-Tradeoff: What are Bias and Variance?

Variance

How much would $\hat{f}$ change if we estimated it using a different data set?
Clearly we expect some variation when using different samples (sampling variation), but not too much.
Flexibel models: moving just a single data point will result in a large change in $\hat{f}$ .

Bias

The difference between $\hat{f}$ and $f$ (notice the missing $ϵ$ ).
We approximate a potentially very complex real phenomenon by a simple model, e.g. linear model.
If true model highly non-linear, linear model will be biased.
General: more flexible, lower bias but higher variance.

31 / 33

Bias Variance Tradeoff vs Flexibility

Here 👉 we illustrate for preceding 3 true $f$ 's
Precise Tradeoff depends on $f$ 's shape.
Bias declines with flexibility.
Test MSE is U-shaped, Var increasing.

32 / 33

END


	bluebery.planterose@sciencespo.fr
	Original Slides from Florian Oswald
	Book
	@ScPoEcon
	@ScPoEcon

33 / 33

Intro to Statistical Learning: ISLR

This set of slides is based on the amazing book An introdution to statistical learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

I'll freely use some of their plots. They say that is ok if I put:

Some of the figures in this presentation are taken from "An Introduction to Statistical Learning, with applications in R" (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani

Thanks so much for putting that resource online for free.

We will try to look at their material with our econometrics background. It's going to be fun!

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

ScPoEconometrics: Advanced