案例学习：预测房价

Loading...

来自 University of Washington 的课程

机器学习：回归

3861 个评分

案例学习：预测房价

从本节课中

Ridge Regression

You have examined how the performance of a model varies with increasing model complexity, and can describe the potential pitfall of complex models becoming overfit to the training data. In this module, you will explore a very simple, but extremely effective technique for automatically coping with this issue. This method is called "ridge regression". You start out with a complex model, but now fit the model in a manner that not only incorporates a measure of fit to the training data, but also a term that biases the solution away from overfitted functions. To this end, you will explore symptoms of overfitted functions and use this to define a quantitative measure to use in your revised optimization objective. You will derive both a closed-form and gradient descent algorithm for fitting the ridge regression objective; these forms are small modifications from the original algorithms you derived for multiple regression. To select the strength of the bias away from overfitting, you will explore a general-purpose method called "cross validation". <p>You will implement both cross-validation and gradient descent to fit a ridge regression model and select the regularization constant.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

Okay, so our first approach is just gonna be to set our gradient = 0 and

solve for W.

But before we get there,

let's just do a little linear algebra review of what identity matrices are.

So the identity matrics is just the matrics analog of the number 1 and

it can be defined in any dimensions so here we show just scalar,

here we show a 2 by 2 matrics and what we see the identity matrix is it just

places 1's along the diagonal and 0's on the octdiagonal.

And that's true in any dimension up to having an N by N matrics.

We have N1s on the diagonal, and every other term in the matrics is 0.

So let's discuss a few fun facts about the identity matrics.

Well if you take the identity matrics, and you multiply by a vector V

and let's say that this identity matrics is sum N by N matrics and

it's vectors in N by one matrics, you're just gonna get the vector V back.

On the other hand, if you multiply this identity matrics by another matrics A,

you're just gonna get, and so the A matrix is some N by M matrics,

you're just gonna get that A matrics back.

Then we can talk about a matrics inverse.

In this case, we're talking about a square matrics, so

A-1A=I are both N by N matrices.

And so by definition, so

let me just write that this was a matrics that we are multiplying by,

and here, by definition of the matrics inverse.

[SOUND] If we take A-1A, then the result is the identity matrics.

That's just speak, like when we think about dividing scalars, so

this inverse is like the matrics equivalent of division,

so if you think of dividing a scalar A by A, you get the number 1.

And so this is matrics analog of that.

And then likewise again for some N by N matrices.

If you multiply A by A inverse you also get the identity.

And you can actually use the last few facts to prove this.

You can simply think about post multiplying both sides by A.

And we have A inverse A, which we know to be the identity matrics,

and then we have A times the identity matrics.

And, actually I should say, both of these results, whether you have V times Sorry,

I should just say it in the matrics case.

A times the identity, you'll likewise get out A.

So here we end up with A equals, you have identity times A,

A = A, which is a proof that this holds here.

Okay.

There are just some fun facts about the identity matrics as well as

inverses that are gonna be useful in this module and

probably in other modules we have later on, as well.

And what we're gonna do now, now that we

understand this identity matrics is simply rewrite the total cost that we had, or

sorry the gradient of the total cost with this identity matrics.

So this exactly the same.

All we've done is we've replaced W.

This W vector, by the identity times W.

So these are equivalent.

But this is gonna be helpful in our next derivation.

Okay, so now we can take this equivalent form of the gradient of our total cost,

and set it equal to zero.

So the first thing we can do is just divide both sides by

two to get rid of those twos.

And then when we multiply

out we get minus HT y-

H-T Hw + lambda Iw =0.

And when we're setting this equal to zero I'm gonna put the hat on the w,

because that's what we're solving for.

So then I can bring, sorry, there should be a plus sign here.

Didn't do that right.

So then I can bring this to the other side,

I get HT H w hat + lambda Iw hat = HT y.

And then what I see is I have w hat appearing in both of these terms.

So I can factor it out.

And I get (HT H + lambda I) times w hat.

So this is the step where having that identity matrics was useful.

So I hope it was worth everything on the last slide

to get that one little punch line.

Okay, so this = HTy, and the end result,

if we use our little inverse

from the previous slide,

if I premultiply both sides by HTy H

+ lambda I inverse, then I get w hat

is equal to (HT H + lambda I) -I HT y.

Okay.

And in particular, I'm gonna call this w hat ridge to indicate that

it's the ridge regression solution for a specific value of lambda.

[MUSIC]