案例学习：预测房价

Loading...

来自 华盛顿大学 的课程

机器学习：回归

3449 评分

案例学习：预测房价

从本节课中

Assessing Performance

Having learned about linear regression models and algorithms for estimating the parameters of such models, you are now ready to assess how well your considered method should perform in predicting new data. You are also ready to select amongst possible models to choose the best performing. <p> This module is all about these important topics of model selection and assessment. You will examine both theoretical and practical aspects of such analyses. You will first explore the concept of measuring the "loss" of your predictions, and use this to define training, test, and generalization error. For these measures of error, you will analyze how they vary with model complexity and how they might be utilized to form a valid assessment of predictive performance. This leads directly to an important conversation about the bias-variance tradeoff, which is fundamental to machine learning. Finally, you will devise a method to first select amongst models and then assess the performance of the selected model. <p>The concepts described in this module are key to all machine learning problems, well-beyond the regression setting addressed in this course.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

So we've just finished our deep dive into the formal definition of

the three different sources of error that we have.

But now, what we're gonna do is we're gonna turn to again,

another optional video.

It's gonna be even more technical,

possibly than the video that we just completed.

To derive why specifically these are the three sources of error, and

why they appear as sigma squared plus squared plus variance.

Okay, so let's start by recalling our definition of expected prediction error,

which was the expectation over trending data sets of our generalization error.

And, here I'm using just a shorthand notation train instead of

training set,, just to save a little bit of space.

I don't mean choo-choo trains, I mean training data sets.

Okay, so let's plug in the formal definition of our generalization error.

And remember that our generalization error was our expectation over all possible

input and output pairs, X, Y pairs of our loss.

And so that's what is written here on the second line.

And then let's remember that we talked about specifying things specifically at

a target XT, and under an assumption of using a loss function of squared error.

And so again we're gonna use this to form all of our derivations.

And so when we make these two assumptions, then this

expected prediction error at xt simplifies to the following where there's no longer

an expectation over x because we're fixing our point in the input space to be xt.

And our expectation over y becomes an expectation over yt because we're only

interested in the observations that appear for an input at xt.

So, the other thing that we've done in this equation is we've

plugged in our specific definition of our loss function as our squared error loss.

So, for the remainder of this video, we're gonna start with this equation and

we're gonna derive why we get this specific form,

sigma-squared plus bias squared plus variance.

So this is the definition of expected prediction error at xt

that we had on the previous slide, under our assumption of squared error loss.

What we can do is we can rewrite this equation as follows,

where what we've done is we've simply added and subtracted the true function,

the true relationship between x and y, specifically at xt.

And because we've just simply added and

subtracted these two quantities, nothing in this equation has changed as a result.

But what this allows us to do is complete our derivation where

in particular what we're gonna use, and

maybe let me just switch colors quickly so that I can do a little aside here.

That's gonna be a useful aside to kind of follow what I'm going through here.

And for this littlest side, so

if we take the expectation of some

quantity a + b squared then what

I'm gonna get is the expectation of

a squared plus 2ab plus b squared

which is equal to the expectation of a squared plus.

Sorry this is getting sloppy here let me just rewrite this little term.

It's plus two times the expectation of ab

plus the expectation of b squared.

And this is simply using the linearity of expectation after I've gone through and

completed this square a plus b.

Okay, and in our case a.

I'll just write this here as this mapping.

This is gonna be our a term and this here is gonna be our b term.

Okay, so the next line I'm writing is using this little identity,

defining the first term as a and the second term as b.

Now let me switch to the blue color which is specifically in this case

let me do one more thing which I think will be helpful.

I'm going to define some shorthand I'll write

in one other color the shorthand notation.

Just to be very clear here,

I'm gonna say, for short hand, that yt,

I'm just gonna write as yf sub w true.

I'm just gonna write as f and

f sub w hat of our training data, I'm just gonna right as F hat.

Okay, this will save me a lot of writing and you a lot of watching.

Okay, so now that we've set the stage for

this derivation, let's rewrite this term here.

So we get the expectation over our training data set and

our observation it's remember I'm writing y t just as y and

I'm going to get the first term squared.

So I'm going to get y- f.

Squared that's my a squared term this first term here.

And then I'm gonna get two times the expectation of a times b,

and let me again specify what the expectations is over the expectations

over training data set and observation Y.

And when I so A times B I get Y minus F times F minus F hat.

And then the final term is I'm going to get the expectation

over my training set and the observation Y.

Of B squared, which is F minus F hat squared.

Okay, so now let's simplify this a bit.

Does anything in this first term depend on my training set?

Well y is not a function of the training data,

F is not a function of the training data, that's the true function.

So this expectation over the training set,

that's not relevant for this first term here.

And when I think about the expectation over y, well what is this?

This is the difference between my observation and the true function.

And that's specifically, that's epsilon.

So what this term here is,

this is epsilon squared.

And epsilon has zero mean so if I take the expectation of

epsilon squared that's just my variance from the world.

That's sigma squared.

Okay so this first term results in sigma squared.

Now let's look at this second term, you know what,

I'm going to write this a little bit differently to make it very clear here.

So I'll just say that this first term here is sigma squared by definition.

Okay, now let's look at this second term.

And again what is Y minus F?

Well Y minus F is this epsilon noise term and

our noise is a completely independent variable from F or F hat.

And so what that means is if you take the expectation,

I think I have some room to do it here.

If I take the expectation of A and B, where A and

B are independent random variables, then the expectation of

A times B is equal to the expectation of A times the expectation of B.

So, this is another little aside.

And, so what I'll get here, is I'm going to get that this term

is the expectation of epsilon times the expectation of F minus F hat.

And what's the expectation of epsilon, my noise?

It's zero, remember we said that again and again,

that we're assuming that epsilon is zero noise, that can be incorporated into F.

This term is zero, the result of this whole thing is going to be zero.

We can ignore that second term.

Now let's look at this last term and this term for

this slide, I'm simply gonna call the mean squared error.

I'm gonna define this little equal with a triangle on top is something that I'm

defining here.

I'm defining this to be equal to something called the mean square error,

let me write that out if you want to look it up later.

Mean square error of F hat.

Now that I've gone through and done that, I can say that

the result of all this derivation is that I get a quantity sigma squared.

Plus mean squared error of F hat.

But so far we've said a million times that my expected prediction error at XT

is sigma squared plus phi squared plus variance.

On the next slide what we're gonna do is we're gonna show how our mean

squared error is exactly equal to bias squared plus variance.

What I've done is I've started this slide by writing mean squared error of remember

on the previous slide we were calling this F hat, that was our shorthand notation.

And so mean squared error of F hat according to

the definition on the previous slide is it's looking at the expectation

of F minus F hat squared.

And I guess here I can mention

when I take this expectation over training data and my observation Y.

Does the observation Y appear anywhere in here, F minus F hat?

No, so I can get rid of that Y there.

If I look at this I'm repeating it here on this next slide where I have

the expectation over my training data of my true function,

which I had on the last slide just been denoting as simply F.

And the estimated function which I had been denoting, let me be clear it's

inside this square that I'm looking at I'd been denoting this as F hat.

And both of these quantities were evaluated specifically at XT.

Again let's go through expanding this, where in this case, when

we rewrite this quantity in a way that's gonna be useful for this derivation,

we're gonna add and subtract F sub W bar and what F sub W bar,

remember that it was the green dashed line in all those bias variance plots.

What F sub W bar is looking average over all possible training data sets,

where for each training data set, I get a specific fitted function and

I average all those fitted functions over those different training data sets.

That's what results in F sub W bar.

It's my average fit that for

my specific model that I'm getting averaging over my training data sets.

And so for simplicity here, I'm gonna refer to F sub W hat.

I mean, sorry, W bar.

As F bar.

This is new notation [SOUND]

on this slide [SOUND].

I guess I'll call it again, just to be clear, new shorthand notation and

this is just going to make things easier to write in these derivations here.

Using that same trick of taking the expectation of

A plus B squared and completing the square and then passing the expectation through,

I'm going to do the same thing here.

New definition of A plus B, but same idea, again.

I'm gonna get the expectation over my training set [SOUND] of now my

first term squared.

I'm gonna get F minus F bar squared,

and then I'm gonna get two expectation

over my training set of E times B, so

that's gonna be F minus F bar,

times B is F bar minus F hat.

And then the final term is the expectation of B squared which in this case is

F bar minus F hat squared, and again this expectation's over the training sets.

Now let's go through and talk about what each of these quantities is.

And the first thing is

let's just remember that F bar what was the definition of F bar formerly?

It was my expectation over training data sets of F hat

of my fitted function on a specific training data set.

I've already taken expectation over the training set here.

F is a true relationship.

F has nothing to do with the training data.

This is a number.

This is the mean of a random variable, and

it no longer has to do with the training data set either.

I've averaged over training data sets.

Here there's really no expectation over trending data sets.

Nothing is random in terms of the trending data set for this first quantity.

This first quantity is really simply F

minus F bar squared, and what is that?

That's the difference between the true function and my average, my expected fit.

Specifically add XT, but squared.

That is bias squared.

That's by definition.

So by definition [SOUND]

this is equal to bias

squared of F hat.

Okay.

Now let's look at this

second term here, and here.

Again, f- fr just like here has,

is not a function of training data.

So, this is just like a scaler.

It can just come out of the expectation so for

this second term I can rewrite this as f minus f bar,

well let's keep the two there, times the expectation over my training data

of f bar minus f hat.

Okay.

And now let's re-write this term, and just pass the expectation through.

And the first thing is again f bar is not a function of training data, so

the result of that is just f bar And then i'm gonna get minus the expectation

over my training data of f hat.

So, what is this?

This is the definition of f bar.

This is taking my specific fit on a specific, so

it's the fit on a specific training data set at xt And

it's taking the expectation over all training data sets.

That's exactly the definition of what f bar is, that average fit.

So, this term here is equal to 0.

Again, by definition.

So, what we end up seeing is this whole second term is gonna disappear.

Because we have some quantity times zero.

Okay, that just leaves one more quantity to analyze and that's this term here

where what I have is an expectation over a function minus it's mean squared.

So, let me just write this in words.

It's an expectation of let's just say,

so the fact that I can equivalently write this as

F hat minus F barred squared.

I hope that's clear that the negative sign there doesn't matter.

It gets squared.

They're exactly equivalent.

And so what is this?

This is a random function at xt

which is equal to just a random variable.

And this is its mean.

And so the definition of taking the expectation of some random

variable minus its mean squared, that's the definition of variance.

So, this term is the variance of f hat.

Okay, so

now we can make our concluding statement about

this mean squared error where what we see is the first term was equal to

bias squared of f hat, and the second term was 0,

and this third term was variance of f hat.

So, what we've shown on this slide is that mean squared error of

our f hat is equal to bias squared of f hat plus variance of f hat.

And that's exactly what we're hoping to show

because now we can talk about putting it all together.

Where what we see is that our expected prediction error at XT

we derived to be equal to Sigma squared plus mean squared error.

And then we derived the fact that mean squared error is equal to

bias squared plus variance.

So, we get the end result that our expected prediction error at Xt is

sigma squared plus bias squared plus variance, and

this represents our three sources of error.

And we've know completed our formal derivation of this.

[MUSIC]