This set of slides is based on the amazing book An introdution to statistical learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.
I'll freely use some of their plots. They say that is ok if I put:
Some of the figures in this presentation are taken from "An Introduction to Statistical Learning, with applications in R" (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
Thanks so much for putting that resource online for free.
We will try to look at their material with our econometrics background. It's going to be fun!
We want to learn the relationship Y ~ X
, where X
has p components.
We assume a general form like Y=f(X)+ϵ
f is a fixed function, but we don't know what it looks like.
We want an estimate ^f for it.
We want to learn the relationship Y ~ X
, where X
has p components.
We assume a general form like Y=f(X)+ϵ
f is a fixed function, but we don't know what it looks like.
We want an estimate ^f for it.
Assume E[ϵ|x]=0!
I.e. we assume to have an identified model
We have done this 👈 many times before already.
But we restricted ourselves to OLS estimation. There are so many ways to estimate f!
The blue shape is true relationship f
Red dots are observed data: Y
Red dots are off blue shape because of ϵ
Fundamental Difference: (🚨 slight exaggerations ahead!)
generate ^Y=^f(X)
^f is a black box
We don't know or care why it works as long as the prediction is good
Fundamental Difference: (🚨 slight exaggerations ahead!)
generate ^Y=^f(X)
^f is a black box
We don't know or care why it works as long as the prediction is good
Why does Y respond to X? (Causality)
How does Y respond to Xp? Interpret parameter estimates
^f is not a black box.
(Out of sample) Prediction often secondary concern.
Remember the data generating process (DGP): Y=f(X)+ϵ
There are two (!) Errors:
We can work to improve the Reducible error
The Irreducible error is a feature of the DGP, hence, nature. Life. Karma. Measurement incurs error.
Remember the data generating process (DGP): Y=f(X)+ϵ
There are two (!) Errors:
We can work to improve the Reducible error
The Irreducible error is a feature of the DGP, hence, nature. Life. Karma. Measurement incurs error.
The squared error for a given estimate ^f is E[Y−^Y]2: Similar to mean squared residuals!
One can easily show that that this factors as E[f(X)+ϵ−^f(X)]2=[f(X)−^f(X)Reducible]2+Var(ϵ)Irreducible
In general:
More nonlinear models are able to get closer to the data.
Hence, they are good predictors
But hard to interpret
Easy to Interpret
Less tight fit to data
worse Prediction
n data points i=1,...,n
yi is i's response
Xi=(xi1,…,xip) are predictors
Data: (X1,y1),…,(Xn,yn)
(Up until now, training data was the only data we have encountered!)
There are two broad classes of learning ^f:
Parametric Learning
Non-Parametric Learning
We make a parametric assumption, i.e. we write down how we think f looks like. E.g. Y=β0+β1x1+⋯+βpxp Here we only have to find p+1 numbers!
We train the model, i.e. we choose the β's. We are pretty good at that -> OLS ✌️
Typically, our model is not the true DGP. Why we want a model in the first place.
If our parametric assumption is a poor model of the true DGP, we will be far away from the truth. Kind of...logical.
The yellow plane is ^f: y=β0+β1educ+β2sen
It's easy to interpret (need only 3 β's to draw this!)
Incurs substantial training error because it's a rigid plane (go back to blue shape to check true f).
We make a no explicit assumption about functional form.
We try to get as close as possible to the data points.
We try to do that under some contraints like:
Usually provides a good fit to the training data.
But it does not reduce the number of parameters!
Quite the contrary. The number of parameters increases so fast that those methods quickly run into feasibility issues (your computer can't run the model!)
The yellow plane is a thin-plate spline
This clearly captures the shape of the true f (the blue one) better: Smaller Training Error.
But it's harder to interpret. Is income
increasing with Seniority
?
We can choose the degree of flexibility or smoothness of our spline surface.
Here we increased flexibility so much that there is zero training error: spline goes through all points!
But it's a much wigglier surface now than before! Even harder to interpret.
You can see that the researcher has an active choice to make here: how smooth?
Parameters which guide choices like that are called tuning parameters.
As ^f becomes too variable, we say there is overfitting: The model tries too hard to fit patterns in the data, which are not part of the true f!
Why would we not always want the most flexible method available?
that's a reasonable question to ask.
The previous slide already gave a partial answer: more flexbility generally leads to more variability.
If we want to use our model outside of our training data set, that's an issue.
This graph offers a nice classification of statistical learning methods in flexibility vs interpretability space.
Sometimes it's obvious what the right choice is for your application.
But often it's not. It's a more complicated tradeoff than the picture suggests.
(It's a very helpful picture!)
We will only be touching upon a small number of those. They are all nicely treated in the ISLR book though!
We have measures of input x and output y
We could predict new y's
Or infer things about Y ~ X
Regression or Classification are typical tasks
We have no measure of output y!
Only a bunch of x's
We are interested in grouping of those x (cluster analysis)
Sometimes clustering is easy: in the left panel the data fall naturally into groups.
When data overlap, it's harder: right panel
We encountered the closely related sum of squared residuals (SSR): SSR=n∑i=1(yi−^f(xi))2
As we know, OLS minimizes the SSR. (minimizing SSR or MSE yields the same OLS estimates.)
We encountered the closely related sum of squared residuals (SSR): SSR=n∑i=1(yi−^f(xi))2
As we know, OLS minimizes the SSR. (minimizing SSR or MSE yields the same OLS estimates.)
However, what MSE 👈 really is: it's the training MSE! It's computed using the same data we used to compute ^f!
Suppose we used data on last 6 months of stock market prices and we want to predict future prices. We don't really care how well we can predict the past prices.
In general, we care about how ^f will perform on unseen data. We call this test data.
We have a training data set {(y1,x1),…,(yn,xn)}
we use those n observations to find the function q that minimizes the Training MSE: ^f=argminqMSE=1nn∑i=1(yi−q(xi))2
We have a training data set {(y1,x1),…,(yn,xn)}
we use those n observations to find the function q that minimizes the Training MSE: ^f=argminqMSE=1nn∑i=1(yi−q(xi))2
We want to know whether ^f will perform well on new data.
Suppose (y0,x0) is unseen data - in particular, we haven't used it to train our model!
We want to know the magnitude of the test MSE: E[(y0−^f(x0))2]
In many cases we don't have a true test data set at hand.
Most methods therefore try to minimize the training MSE. (OLS does!)
At first sight this seems really reasonable.
In many cases we don't have a true test data set at hand.
Most methods therefore try to minimize the training MSE. (OLS does!)
At first sight this seems really reasonable.
The problem is that test and training MSE are less closely related than one might think!
Very small training MSEs might go together with pretty big test MSEs!
That is, most methods are really good at fitting the training data, but they fail to generalize to outside of that set of point!
In an artifical setting we now the test data because we know the true f.
Here Solid black line. 👉
Increasing flexibility mechanically reduces training error (grey curve in right panel.)
However not the test MSE, in general (red curve!)
ScPoApps::launchApp("bias_variance_tradeoff")
What's going on here?
Initially, increasing flexibility provides a better fit to the observed data points, decreasing the training error.
That means that also the test error decreases for a while.
As soon as we start overfitting the data points, though, the test error starts to increase again!
At very high flexibility, our method tries to fit patterns in the data which are not part of the true f (the black line)!
To make matters worse, the extent of this phenomenon will depend on the shape of the underlying true f!
In this example, the true f is almost linear.
The inflexible method does well!
Increasing flexibility incurs large testing MSE.
In this example, the true f is very non linear.
The inflexible method does very poorly in both trainign and testing MSE.
the model at 10 degrees of freedom performs best here.
👉 You can see that the best model is not obvious to choose!
We can decompose the expected test MSE as follows: E(y0−^f(x0))2=Var(^f(x0))+[Bias(^f(x0))]2+Var(ϵ)
From this we can see that we have to minimize both variance and bias when chooseing a suitable method.
We have seen before that those are competing forces in some situations.
Notice that the best we could achieve is Var(ϵ)>0 since that is a feature of our DGP.
How much would ^f change if we estimated it using a different data set?
Clearly we expect some variation when using different samples (sampling variation), but not too much.
Flexibel models: moving just a single data point will result in a large change in ^f.
The difference between ^f and f (notice the missing ϵ).
We approximate a potentially very complex real phenomenon by a simple model, e.g. linear model.
If true model highly non-linear, linear model will be biased.
General: more flexible, lower bias but higher variance.
Here 👉 we illustrate for preceding 3 true f's
Precise Tradeoff depends on f's shape.
Bias declines with flexibility.
Test MSE is U-shaped, Var increasing.
This set of slides is based on the amazing book An introdution to statistical learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.
I'll freely use some of their plots. They say that is ok if I put:
Some of the figures in this presentation are taken from "An Introduction to Statistical Learning, with applications in R" (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani
Thanks so much for putting that resource online for free.
We will try to look at their material with our econometrics background. It's going to be fun!
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |