案例学习：预测房价

Loading...

来自 University of Washington 的课程

机器学习：回归

4068 个评分

案例学习：预测房价

从本节课中

Assessing Performance

Having learned about linear regression models and algorithms for estimating the parameters of such models, you are now ready to assess how well your considered method should perform in predicting new data. You are also ready to select amongst possible models to choose the best performing. <p> This module is all about these important topics of model selection and assessment. You will examine both theoretical and practical aspects of such analyses. You will first explore the concept of measuring the "loss" of your predictions, and use this to define training, test, and generalization error. For these measures of error, you will analyze how they vary with model complexity and how they might be utilized to form a valid assessment of predictive performance. This leads directly to an important conversation about the bias-variance tradeoff, which is fundamental to machine learning. Finally, you will devise a method to first select amongst models and then assess the performance of the selected model. <p>The concepts described in this module are key to all machine learning problems, well-beyond the regression setting addressed in this course.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

So by now we've gone through notions of three sources of error at a level that's

needed to be a practitioner in machine learning.

But we know that some of you are interested in more of the technical

underpinnings of these ideas both in terms of mathematical formalisms and

statistical understanding.

And so what we've done is created this optional video that provides

a much more technical definition of the three sources of error for

those of you that are interested in this material.

But we wanna highlight that this is completely optional

because this will be taught at a more technical level than what we're

assuming in the rest of the specialization.

So you mentioned that the training set is just a random sample of some and

observations.

In this case, some N houses that were sold and recorded, but

what if N other houses had been sold and recorded?

How would our performance change?

So for example, here in this picture we're showing one set of

N observations that are used for training data, those are the blue circles.

And we fit some quadratic function through this data and here we show

some other set of N observations and we see that we get a different fit.

And to assess our performance of each one of these fits we can think about looking

at generalization error.

So in the first case we might get one

generalization error of this specific fit w hat 1.

And in the second case we would get some different

evaluation of generalization error.

Let's call it generalization error of w hat 2.

But one thing that we might be interested in is, how do we perform on average for

a training data set of N observations?

Because imagine them trying to develop a tool that's gonna be

used by real estate agents to form these types of predictions.

Well I like to design my tool, package it up and send it out there, and

then a real estate agent might come in and have some set of observations of house

sales from their neighborhood that they're using to make their predictions.

So that might be different than another real estate agent.

And what I'd like to know, is for a given amount of data,

some training set of size N, how well should I expect the performance

of this model to be, regardless of what specific training dataset I'm looking at?

So in these cases what we like to do is average our performance over

all possible fits that we might get.

What I mean by that is all possible training data sets

that might have appeared, and the resulting fits on those data sets.

So formerly, we're gonna define this thing called expected prediction error which is

the expected value of our generalization error, over different training data sets.

So very specifically, for

a given training data set, we get parameters that are fit to that data set.

So I'll call that w hat of training set.

And then for that estimated model, I can evaluate my

generalization error and what the expected prediction error is doing is it's taking

a weighted average over all possible training sets that I might have seen.

Where for each one I get a different set of estimated parameters and

thus a different notion of the generalization error.

And to start analyzing this quantity of prediction error,

let's specifically look at some target input xt,

which might be a house with 2,640 square feet.

And let's also take our loss function to be squared error.

So in this case when we're talking specifically about a target point xt.

What we can do later after we do the analysis specifically for xt is we can

think about averaging this over all possible xt's, over all x all square feet.

But in some cases we might actually be interested in one region of our input

space in particular.

And then when we talk about using squared error in particular, this is gonna allow

our analysis to follow through really nicely as we're gonna show not in this

video, but in our next even more in depth video which is also optional.

But under these assumptions of looking specifically at xt and

looking at squared error as our measure of loss.

You can show that the average prediction error

at xt is simply the sum of three terms which we're gonna go through.

Sigma squared plus pi squared plus variants.

So these terms are yet to be defined, and this is what we're gonna walk through

in this video in a much more formal way than we did in the previous set of slides.

So let's start by talking about this first term, sigma squared and

what this is gonna represent is the noise we talked about in the earlier videos.

So in particular, remember that we're saying that there's

some true relationship between square feet and house value.

That that's just a relationship that exists out there in the world, and

that's captured by f sub w true, but

of course that doesn't fully capture how we think about the value of a house.

There are other factors at play.

And so all those other factors out there in the world are captured by

our noise term, which here we write as just an additive term plus epsilon.

So epsilon is our noise, and we said that this noise term has zero meaning cuz

if not we can just shove that other component into f sub w true.

But we're just gonna make the assumption that epsilon has 0 mean then

we can start talking about what is the spread of noise you're likely to see at

any point in the input space.

And that spread is called the variance.

So we denote it by sigma squared and

sigma squared is the variance of this noise epsilon.

And as we talked about before, this noise is just noise that's out there in

the world, we have no control over it no matter how complicated and

interesting of a model, we specify our algorithm for fitting that model.

We can't do anything about the fact that we're using x for our prediction.

But there's just inherently some noise in

how our observations are generated in the world.

So for this reason, this is called our irreducible error.

Because it's noise that we can't reduce through any choices that we have

control over.

So now let's talk about this second term, bias squared.

And remember that when we talked about bias this was a notion of how well our

model could on average fit the true relationship between x and y.

But now let's go through this at a much more formal level.

And in particular let's just remember that there's some relationship between square

feet and house value in our case which is represented by this orange line.

And then from this true world we get some data set and

to find a training set which are these blue circles.

And using this training data we estimate our model parameters.

Well, if we had gotten some other set of endpoints,

we would have fit some other functions.

Now, when I look over all possible data sets of size N that I might have gotten,

where remember where this blue shaded region

here represents the distribution over x and y.

So how likely it is to get different combinations of x and y.

And let's say, I draw endpoints from this joint distribution over x and

y and over all possible values I look at an estimated function.

So for example here are the two, estimated functions from the previous slide,

those example data sets that I showed.

But of course there's a whole continuum of estimated functions that I get for

different training sets of size N.

Then when I average these estimated functions, these specific fits over

all my possible training data sets, what I get is my average fit.

So now let's talk about this a little bit more formally.

We had already presented this in our previous video.

This f sub w bar.

But now, let's define this.

This is the expectation of a specific fit on a specific training data set or

let me rephrase that, the fit I get on a specific training data set

averaged over all possible training data sets of size N that I might get.

So that is the formal definition of this f sub w bar,

what we have been calling our average fit.

And what we're talking about when we're talking about bias is,

we're talking about comparing this average fit to the true relationship.

And here remember again, we're focusing specifically on some target xt.

And so the bias at xt is the difference between

the true relationship at xt between xt and y.

So between a given square feet and the house value

whatever the true relationship is between that input and

the observation versus this average relationship

estimated over all possible training data sets.

So that is the formal notion of bias of xt, and let's just remember that

when it comes in as our error term, we're looking at bias squared.

So that's the second term.

Now let's turn to this third term which is variance.

And let's go through this definition where again,

we're interested in this average fit f sub w bar, this green dashed line.

But that really isn't the quantity of interest.

It's gonna be used in our definition here.

But the thing that we're really interested in,

is over all possible fits we might see.

How much do they deviate from this expected fit?

So thinking about again, specifically at our target xt,

how much variation is there in the training dataset

specific fits across all training datasets we might see?

And that's this variance term and now again, let's define it very formally.

Well let me first state what variance is in general.

So variance of some random variable is simply looking at

the expected value of that random variable minus its main squared.

So in this context, when we're looking at the variability

of these functions at xt, we're taking the expectation and

our random quantity is our estimated function for

a specific training data set at xt.

And then what's the mean of that random function?

The mean is this average fit.

This f sub w bar.

So we're looking at the difference between fit on a specific training dataset and

what I expect to earn averaged over all possible training datasets.

I look at that quantity squared and what is my expectation taken over?

Sorry, let me just mention that this quantity when I take this squared,

represents a notion of how much deviation

a specific fit has from the expected fit at xt.

And then when I think about what the expectation is taking over,

it's taking over all possible training data sets of size N.

So that's my variance term.

And when we think intuitively about

why it makes sense that we have the sum of these three terms in this specific form.

Well what we're saying is variance is telling us

how much can my specific function that I'm using for prediction.

I'm just gonna use one of these functions for prediction.

I get a training dataset that gives me an f sub w hat,

I'm using that for prediction.

Well, how much can that deviate from my expected fit over all

datasets I might have seen.

So again, going back to our analogy, I'm a real estate agent, I grab my data set,

I fit a specific function to that training data.

And I wanna know well, how wild of a fit could this be relative to what

I might have seen on average over all possible

datasets that all these other realtors are using out there?

And so of course, if the function from one realtor to

another realtor looking at different data sets can vary dramatically,

that can be a source of error in our predictions.

But another source of error which the biases is capturing

is over all these possible datasets, all these possible realtors.

If this average function just can never capture anything close to their

true relationship between square feet and house value, then we

can't hope to get good predictions either and that's what our bias is capturing.

And why are we looking at bias squared?

Well, that's putting it on an equal footing of these variance terms

because remember bias was just the difference between the true value and

our expected value.

But these variance terms are looking at these types of quantities but squared.

So that's intuitively why we get five squared and then finally,

what's our third sense of error?

Well let's say, I have no variance in my estimator always very low variance.

And the model happens to be a very good fit so neither of these things are sources

of error, I'm doing basically magically perfect on my modeling side,

while still inherently there's noise in the data.

There are things that just trying to form predictions from square feet alone

can't capture.

And so that's where irreducible error or this sigma squared is coming through.

And so intuitively this is why our prediction errors are a sum of these three

different terms that now we've defined much more formally.

[MUSIC]