案例学习：预测房价

Loading...

来自 University of Washington 的课程

机器学习：回归

3852 个评分

案例学习：预测房价

从本节课中

Assessing Performance

Having learned about linear regression models and algorithms for estimating the parameters of such models, you are now ready to assess how well your considered method should perform in predicting new data. You are also ready to select amongst possible models to choose the best performing. <p> This module is all about these important topics of model selection and assessment. You will examine both theoretical and practical aspects of such analyses. You will first explore the concept of measuring the "loss" of your predictions, and use this to define training, test, and generalization error. For these measures of error, you will analyze how they vary with model complexity and how they might be utilized to form a valid assessment of predictive performance. This leads directly to an important conversation about the bias-variance tradeoff, which is fundamental to machine learning. Finally, you will devise a method to first select amongst models and then assess the performance of the selected model. <p>The concepts described in this module are key to all machine learning problems, well-beyond the regression setting addressed in this course.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

Okay, so, we've talked about three different measures of error.

And now in this part, we're gonna talk about three different sources of error.

And this is gonna lead us into a conversation of the bias variance

trade-off.

Okay, so when we were forming our prediction,

there are three different sources of error.

Noise, bias, and variance.

And in this part, we're gonna walk through these three different components,

at a very high level.

At a more intuitive level.

And then following this, there are gonna be two optional sections

that go into much more formalism and detail about this.

But those are optional because we're not requiring that you know

this to get through the course.

But for those that are interested,

we will be providing the formalism behind these notions that I'm presenting now.

Let's look at this first term, this noise term.

And as we've mentioned many times in this specialization, data are inherently noisy.

So the way the world works is that there's some true relationship between

square feet and the value of a house.

Or generically, between x and y.

And we're representing that arbitrary relationship defined by the world,

by f sub w true.

Which is the notation we're using for that functional relationship.

But of course that's not a perfect description between x and y.

The number of square feet and the house value.

There are lot of other contributing factors including

other attributes of the house that are not included just in square feet or

how a person feels when they go in and make a purchase of a house or

a personal relationship they might have with the owners.

Or lots and lots of other things that we can't ever perfectly capture with just

some function between square feet and value, and so

that is the noise that's inherent in this process represented by this epsilon term.

So in particular for any observation yi it's the sum of

this relationship between the square feet and

the value plus this noise term epsilon i specific to that i house.

And we've talked before about our assumption that this noise has zero

mean because if it didn't that could be shoved into the f function instead.

But what we haven't talked about is the spread of that noise.

So at any given square feet what kind of variation and

house price are we likely to see

based on this type of noise that's inherent in our observations.

And so this is referred to as the variance of this noise term epsilon.

And this is something that's just a property of the data.

We don't have control over this.

This has nothing to do with our model nor

our estimation procedure, it's just something that we have to deal with.

And so this is called Irreducible error because it's nothing that

we can reduce through choosing a better model or a better estimation procedure.

Okay, so the things that we can control are bias and variance, so

we're gonna focus quite heavily on those two terms.

So let's start by talking about bias.

And this is basically just an assessment of how well my model can fit

the true relationship between x and y.

So to think about this, let's think about how we get data in our data set.

So here these points that we observed they're just a random snapshot of

N houses that were sold and recorded and we tabulated in our data set.

Well, based on that data set, we fit some function and,

thinking about bias, it's intuitive to start which is a very very simple

model of just a constant function, so that's what I'm gonna show here.

But we fit whatever model we're specifying.

But what if another set of N houses had been sold?

Then we would have had a different data set that we were using.

And when we went to fit our model, we would have gotten a different line.

Okay.

And to make this point pretty explicit, I wanna go back and

look at little bit at these points that I drew here.

In the first data set, I tended to draw points that were below the true

relationship so they happen to have, our houses in our data set

happened to have values less than what the world kind of specifies as typical.

And on the right hand side I drew points that tended to lie above the line.

So these are pretty extremely different data sets, but

what you see is that the fits are pretty similar.

So this is gonna come up later and I wanted to point this out now.

Okay, let's get back to this notion of bias.

So what we are saying is, over all possible data sets of size N that we might

have been presented with of house sales, what do we expect our fit to look like?

So for one data set of size N we get this fit.

Here's another dataset.

Here's another data set.

Or the fits associated with those data sets.

And of course there's a continuum of possible fits we might have gotten.

And for all those possible fits, here this dashed green line represents our average

fit, averaged over all those fits weighted by how likely they were to have appeared.

Okay, so now we can start talking about bias.

What bias is, is it's the difference between this average fit and

the true function, f true.

Okay, so, that's what this equation shows here, and

we're seeing this with this gray shaded region.

That's the difference between the true function and our average fit.

And so intuitively what bias is saying is,

is our model flexible enough to on average be able to

capture the true relationship between square feet and house value.

And what we see is that for this very simple constant model,

this low complexity model has high bias.

It's not flexible enough to have a good approximation to the true relationship.

And because of these differences, because of this bias,

this leads to errors in our prediction.

[MUSIC]