+ - 0:00:00
Notes for current slide
Notes for next slide

ScPoEconometrics: Advanced

Intro to Statistical Learning

Bluebery Planterose

SciencesPo Paris
2023-04-11

1 / 33

Intro to Statistical Learning: ISLR

  • This set of slides is based on the amazing book An introdution to statistical learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

  • I'll freely use some of their plots. They say that is ok if I put:

    Some of the figures in this presentation are taken from "An Introduction to Statistical Learning, with applications in R" (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani

  • Thanks so much for putting that resource online for free.

  • We will try to look at their material with our econometrics background. It's going to be fun!

2 / 33

What is Statistical Learning?

  • We want to learn the relationship Y ~ X, where X has p components.

  • We assume a general form like Y=f(X)+ϵ

  • f is a fixed function, but we don't know what it looks like.

  • We want an estimate f^ for it.

3 / 33

What is Statistical Learning?

  • We want to learn the relationship Y ~ X, where X has p components.

  • We assume a general form like Y=f(X)+ϵ

  • f is a fixed function, but we don't know what it looks like.

  • We want an estimate f^ for it.

  • Assume E[ϵ|x]=0!

  • I.e. we assume to have an identified model

  • We have done this 👈 many times before already.

  • But we restricted ourselves to OLS estimation. There are so many ways to estimate f!

3 / 33

An Example of f




  • The blue shape is true relationship f

  • Red dots are observed data: Y

  • Red dots are off blue shape because of ϵ

4 / 33

What Do You Want To Do with your f^?

Fundamental Difference: (🚨 slight exaggerations ahead!)

Prediction (Machine Learning, AI)

  • generate Y^=f^(X)

  • f^ is a black box

  • We don't know or care why it works as long as the prediction is good

5 / 33

What Do You Want To Do with your f^?

Fundamental Difference: (🚨 slight exaggerations ahead!)

Prediction (Machine Learning, AI)

  • generate Y^=f^(X)

  • f^ is a black box

  • We don't know or care why it works as long as the prediction is good

Inference (ECON)

  • Why does Y respond to X? (Causality)

  • How does Y respond to Xp? Interpret parameter estimates

  • f^ is not a black box.

  • (Out of sample) Prediction often secondary concern.

5 / 33

What makes a Good prediction?

Remember the data generating process (DGP): Y=f(X)+ϵ

  • There are two (!) Errors:

    1. Reducible error f^
    2. Irredicuble error ϵ
  • We can work to improve the Reducible error

  • The Irreducible error is a feature of the DGP, hence, nature. Life. Karma. Measurement incurs error.

6 / 33

What makes a Good prediction?

Remember the data generating process (DGP): Y=f(X)+ϵ

  • There are two (!) Errors:

    1. Reducible error f^
    2. Irredicuble error ϵ
  • We can work to improve the Reducible error

  • The Irreducible error is a feature of the DGP, hence, nature. Life. Karma. Measurement incurs error.

  • The squared error for a given estimate f^ is E[YY^]2: Similar to mean squared residuals!

  • One can easily show that that this factors as E[f(X)+ϵf^(X)]2=[f(X)f^(X)Reducible]2+Var(ϵ)Irreducible

6 / 33

First Classification of Estimators

In general:

Nonlinear Models

  • More nonlinear models are able to get closer to the data.

  • Hence, they are good predictors

  • But hard to interpret

Linear Models

  • Easy to Interpret

  • Less tight fit to data

  • worse Prediction

7 / 33

How to Estimate an f?

Training Data

  1. n data points i=1,...,n

  2. yi is i's response

  3. Xi=(xi1,,xip) are predictors

  4. Data: (X1,y1),,(Xn,yn)

(Up until now, training data was the only data we have encountered!)

Estimate f^ = Learn f^

There are two broad classes of learning f^:

  1. Parametric Learning

  2. Non-Parametric Learning

8 / 33

Parametric Methods

9 / 33

Parametrics Methods

Procedure

  1. We make a parametric assumption, i.e. we write down how we think f looks like. E.g. Y=β0+β1x1++βpxp Here we only have to find p+1 numbers!

  2. We train the model, i.e. we choose the β's. We are pretty good at that -> OLS ✌️

Potential Issues

  • Typically, our model is not the true DGP. Why we want a model in the first place.

  • If our parametric assumption is a poor model of the true DGP, we will be far away from the truth. Kind of...logical.

10 / 33

A Parametric Model for f

  • The yellow plane is f^: y=β0+β1educ+β2sen

  • It's easy to interpret (need only 3 β's to draw this!)

  • Incurs substantial training error because it's a rigid plane (go back to blue shape to check true f).

11 / 33

Non-Parametric Methods

  • We make a no explicit assumption about functional form.

  • We try to get as close as possible to the data points.

  • We try to do that under some contraints like:

    • Not too rough
    • Not too wiggly
  • Usually provides a good fit to the training data.

  • But it does not reduce the number of parameters!

  • Quite the contrary. The number of parameters increases so fast that those methods quickly run into feasibility issues (your computer can't run the model!)

12 / 33

A Non-Parametric Model for f


  • The yellow plane is a thin-plate spline

  • This clearly captures the shape of the true f (the blue one) better: Smaller Training Error.

  • But it's harder to interpret. Is income increasing with Seniority?

13 / 33

Overfitting: Choosing Smoothness

  • We can choose the degree of flexibility or smoothness of our spline surface.

  • Here we increased flexibility so much that there is zero training error: spline goes through all points!

  • But it's a much wigglier surface now than before! Even harder to interpret.

14 / 33

Overfitting: Choosing Smoothness

Smooth, not wiggly

Smooth but high variance (wiggly!)

15 / 33

Overfitting: Over-doing it

  • You can see that the researcher has an active choice to make here: how smooth?

  • Parameters which guide choices like that are called tuning parameters.

  • As f^ becomes too variable, we say there is overfitting: The model tries too hard to fit patterns in the data, which are not part of the true f!

16 / 33

What Method To Aim For?

Why would we not always want the most flexible method available?

  • that's a reasonable question to ask.

  • The previous slide already gave a partial answer: more flexbility generally leads to more variability.

  • If we want to use our model outside of our training data set, that's an issue.

17 / 33

Classifying Methods 1: flexibility vs interpretability

  • This graph offers a nice classification of statistical learning methods in flexibility vs interpretability space.

  • Sometimes it's obvious what the right choice is for your application.

  • But often it's not. It's a more complicated tradeoff than the picture suggests.

  • (It's a very helpful picture!)

  • We will only be touching upon a small number of those. They are all nicely treated in the ISLR book though!

18 / 33

Classifying Methods 2: Supervised vs Unsupervised Learning

Supervised Learning

  • We have measures of input x and output y

  • We could predict new y's

  • Or infer things about Y ~ X

  • Regression or Classification are typical tasks

Unsupervised Learning

  • We have no measure of output y!

  • Only a bunch of x's

  • We are interested in grouping of those x (cluster analysis)

19 / 33

Clustering Example


  • Sometimes clustering is easy: in the left panel the data fall naturally into groups.

  • When data overlap, it's harder: right panel

20 / 33

Assessing Model Accuracy

What is a good model?

21 / 33

Quality of Fit: the Mean Squard Error

  • We know the mean squared error (MSE) already: MSE=1ni=1n(yif^(xi))2
  • We encountered the closely related sum of squared residuals (SSR): SSR=i=1n(yif^(xi))2

  • As we know, OLS minimizes the SSR. (minimizing SSR or MSE yields the same OLS estimates.)

22 / 33

Quality of Fit: the Mean Squard Error

  • We know the mean squared error (MSE) already: MSE=1ni=1n(yif^(xi))2
  • We encountered the closely related sum of squared residuals (SSR): SSR=i=1n(yif^(xi))2

  • As we know, OLS minimizes the SSR. (minimizing SSR or MSE yields the same OLS estimates.)

  • However, what MSE 👈 really is: it's the training MSE! It's computed using the same data we used to compute f^!

  • Suppose we used data on last 6 months of stock market prices and we want to predict future prices. We don't really care how well we can predict the past prices.

  • In general, we care about how f^ will perform on unseen data. We call this test data.

22 / 33

Training MSE vs Test MSE

Training

  • We have a training data set {(y1,x1),,(yn,xn)}

  • we use those n observations to find the function q that minimizes the Training MSE: f^=argminqMSE=1ni=1n(yiq(xi))2

23 / 33

Training MSE vs Test MSE

Training

  • We have a training data set {(y1,x1),,(yn,xn)}

  • we use those n observations to find the function q that minimizes the Training MSE: f^=argminqMSE=1ni=1n(yiq(xi))2

Testing

  • We want to know whether f^ will perform well on new data.

  • Suppose (y0,x0) is unseen data - in particular, we haven't used it to train our model!

  • We want to know the magnitude of the test MSE: E[(y0f^(x0))2]

23 / 33

A Problem of MSEs

  • In many cases we don't have a true test data set at hand.

  • Most methods therefore try to minimize the training MSE. (OLS does!)

  • At first sight this seems really reasonable.

24 / 33

A Problem of MSEs

  • In many cases we don't have a true test data set at hand.

  • Most methods therefore try to minimize the training MSE. (OLS does!)

  • At first sight this seems really reasonable.

  • The problem is that test and training MSE are less closely related than one might think!

  • Very small training MSEs might go together with pretty big test MSEs!

  • That is, most methods are really good at fitting the training data, but they fail to generalize to outside of that set of point!

24 / 33

Simulation: We know the test data!

  • In an artifical setting we now the test data because we know the true f.

  • Here Solid black line. 👉

  • Increasing flexibility mechanically reduces training error (grey curve in right panel.)

  • However not the test MSE, in general (red curve!)

25 / 33

Simulation: App!

  • Let's look at our app online or ScPoApps::launchApp("bias_variance_tradeoff")

26 / 33

So! A Tradeoff at Last!

  • What's going on here?

  • Initially, increasing flexibility provides a better fit to the observed data points, decreasing the training error.

  • That means that also the test error decreases for a while.

  • As soon as we start overfitting the data points, though, the test error starts to increase again!

  • At very high flexibility, our method tries to fit patterns in the data which are not part of the true f (the black line)!

  • To make matters worse, the extent of this phenomenon will depend on the shape of the underlying true f!

27 / 33

Almost linear f

  • In this example, the true f is almost linear.

  • The inflexible method does well!

  • Increasing flexibility incurs large testing MSE.

28 / 33

Highly Non-linear f

  • In this example, the true f is very non linear.

  • The inflexible method does very poorly in both trainign and testing MSE.

  • the model at 10 degrees of freedom performs best here.

  • 👉 You can see that the best model is not obvious to choose!

29 / 33

Formalizing the Bias-Variance-Tradeoff

  • We can decompose the expected test MSE as follows: E(y0f^(x0))2=Var(f^(x0))+[Bias(f^(x0))]2+Var(ϵ)

  • From this we can see that we have to minimize both variance and bias when chooseing a suitable method.

  • We have seen before that those are competing forces in some situations.

  • Notice that the best we could achieve is Var(ϵ)>0 since that is a feature of our DGP.

30 / 33

Bias-Variance-Tradeoff: What are Bias and Variance?

Variance

  • How much would f^ change if we estimated it using a different data set?

  • Clearly we expect some variation when using different samples (sampling variation), but not too much.

  • Flexibel models: moving just a single data point will result in a large change in f^.

Bias

  • The difference between f^ and f (notice the missing ϵ).

  • We approximate a potentially very complex real phenomenon by a simple model, e.g. linear model.

  • If true model highly non-linear, linear model will be biased.

  • General: more flexible, lower bias but higher variance.

31 / 33

Bias Variance Tradeoff vs Flexibility

  • Here 👉 we illustrate for preceding 3 true f's

  • Precise Tradeoff depends on f's shape.

  • Bias declines with flexibility.

  • Test MSE is U-shaped, Var increasing.

32 / 33

END

bluebery.planterose@sciencespo.fr
Original Slides from Florian Oswald
Book
@ScPoEcon
@ScPoEcon
33 / 33

Intro to Statistical Learning: ISLR

  • This set of slides is based on the amazing book An introdution to statistical learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.

  • I'll freely use some of their plots. They say that is ok if I put:

    Some of the figures in this presentation are taken from "An Introduction to Statistical Learning, with applications in R" (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani

  • Thanks so much for putting that resource online for free.

  • We will try to look at their material with our econometrics background. It's going to be fun!

2 / 33
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow