案例学习：预测房价

Loading...

来自 华盛顿大学 的课程

机器学习：回归

3449 评分

案例学习：预测房价

从本节课中

Ridge Regression

You have examined how the performance of a model varies with increasing model complexity, and can describe the potential pitfall of complex models becoming overfit to the training data. In this module, you will explore a very simple, but extremely effective technique for automatically coping with this issue. This method is called "ridge regression". You start out with a complex model, but now fit the model in a manner that not only incorporates a measure of fit to the training data, but also a term that biases the solution away from overfitted functions. To this end, you will explore symptoms of overfitted functions and use this to define a quantitative measure to use in your revised optimization objective. You will derive both a closed-form and gradient descent algorithm for fitting the ridge regression objective; these forms are small modifications from the original algorithms you derived for multiple regression. To select the strength of the bias away from overfitting, you will explore a general-purpose method called "cross validation". <p>You will implement both cross-validation and gradient descent to fit a ridge regression model and select the regularization constant.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

So here we are back at our polynomial regression demo.

And remember when we're just doing these squares estimation.

Let's just quickly scroll through this.

Remember we had this data generated from a sine function.

And when we to fit a degree-2 polynomial, things looked pretty reasonable.

Degree-4 started looking a bit wigglier, larger estimated coefficients and

at degree-16 looked really wiggly and had these massive, massive coefficients, okay.

And now let's get to our ridge regression, where we're just gonna take our

polynomial regression function and modify it.

And in using Graph Lab Create it's really simple to do the ridge regression

modification because, as we mentioned before, there's this l2 penalty input.

To .linear_regression.

And before,

when we're doing just lee squares we set that L2 penalty equal to zero.

And this is this lambda value that we're talking about in trading off

between fit and model complexity.

So here though, we're gonna actually specify a value for this penalty.

And that's the only modification that we have to make in order to implement

ridge regression using Graph Lab Create.

But again in the assignments for

this course you're gonna explore implementing these methods yourself.

Okay so let's go and define this polynomial ridge regression function.

And then we're just gonna go through and

explore performing a fit of this really high order polynomial, this 16th

order polynomial that had very wiggly fit, crazy coefficients associated with it.

But now, looking at solving the ridge regression objective for

different values of lambda.

So to start with, let's consider a really, really small lambda value.

So a very small penalty on the two norm of the coefficients.

And what we'd expect is that the estimated fit would look very

similar to the standard lee squares case.

And if we look at the plot, this figure looks very very similar,

if I scroll up quickly, to the fit we had doing just standard lee squares.

So that checks out to what we know should happen, and, likewise,

the coefficients are still these really really massive numbers.

Okay, but what if we increase the strength of our penalty.

So let's consider a very large L2 penalty.

Here we're considering a value of 100, whereas in the case above we were

considering a value of one-eighth to the -25, so really really tiny.

Well in this case, we end up with much smaller coefficients.

Actually they look really really small.

So let's look at what the fit looks like.

And we see a really, really smooth curve.

And very flat, actually probably way too simple of a description for

what's really going on in the data.

It doesn't seem to capture this trend of the data.

The value's increasing and then decreasing.

It just gets a constant fit followed by a decrease.

So, this seems to be under-fit and so

as we expect, what we have is that when lambda is really really small

we get something similar to our lee square solution and when lambda becomes

really really large we start approaching all the coefficients going to 0.

Okay so now what we're gonna do is look at the fit

as a function of a series of different lambda values going from our

1e to the minus 25 all the way up to the value of 100.

But looking at some other intermediate values as well to look at what the fit and

coefficients look like as we increase lambda.

So we're starting with these crazy, crazy large values.

By the time we're at 1e to the -10 for lambda,

the values have decreased by two orders of magnitude so have times 10 to the 4th now.

Then we keep increasing lambda.

1e to the -6.

And we get values on the order of hundreds for

our coefficients, so in terms of reasonability

of these values I'd say that they start looking a little bit more realistic.

And then we keep going and

you see that the value of the coefficients keep decreasing, and when

we get to this value of lambda that's 100 we get these really small coefficients.

But now lets look at what the fits are for these different lambda values.

And here's the plot that we've been showing before for

this really small lambda.

Increasing the lambda a bit smoother fit, still pretty wiggly and crazy,

especially on these boundary points.

Increase lambda more, things start looking better.

When we get to 1e to the -3, this looks pretty good.

Especially here, it's hard to tell whether the function should be going up or down.

I want to emphasize that app boundaries where you have few observations,

it's very hard to control the fit so we believe much more the fit in

intermediate regions of our x range where we have observations.

Okay but then we get to this really large lambda and

we see that clearly we're over smoothing across the data.

So a natural question is, out of all these possible lambda values we might consider,

and all the associated fits, which is the one that we should use for

forming our predictions?

Well, it would be really nice if there were some automatic procedure for

selecting this lambda value instead of me having to go through,

specify a large set of lambdas, look at the coefficients, look at the fit, and

somehow make some judgment call about which one I want to use.

Well, the good news is that there is a way to automatically choose lambda.

And this is something we're gonna discuss later in this module.

So one method that we're gonna talk about is something called

leave one out cross validation.

And what leave one out cross validation does is it approximates, so minimizing

this leave one out cross-validation error that we're gonna talk about,

approximates minimizing the average mean squared error in our predictions.

So, what we're gonna do here is we're gonna

define this leave one out cross-validation function and then apply it to our data.

And, this leave one

out cross validation function, you're not gonna understand what's going on here yet.

But you will by the end of this module.

You'll be able to implement this method yourself.

But what it's doing is it's looking at

prediction error of different lambda values and then choosing one to minimize.

But of course we're not looking at that on the training error or

on the, sorry on the training set or the test set, we're using a validation set but

in a very specific way.

Okay, so now that we've applied this

leave one out function to our data in some set of specified

penalty values, we can look at what the plot of this leave one

out cross validation error looks like as a function of our considered lambda values.

And in this case, we actually see a curve that's pretty flat.

In a bunch of regions.

And what this means is that our fits are not so

sensitive to those choice of lambda in these regions.

But there is some minimum and we can figure out what that minimum is here.

So here we're just selecting the lambda that has the lowest

cross validation error.

And then we're gonna fit our polynomial

ridge regression model using that specific lambda value.

And we're printing our coefficients and

what you see is we have very reasonable numbers.

Things on the order of 1, .2, .5, and let's look at the associated fit.

And things look really nice in this case.

So, there is a really nice trend throughout most of the range of x.

The only place that things look a little bit crazy is out here in the boundary.

But again, at this boundary region we actually don't have any data to really pin

down this function.

So, considering it's a 16 order polynomial,

we're shrinking coefficients but we don't really

have much information about what the function should do out here.

But what we've seen is that this leave one out cross validation technique

really nicely selects a lambda value that provides a good fit and

automatically does this balance of bias and variance for us.

[MUSIC]