案例学习：预测房价

Loading...

来自 University of Washington 的课程

机器学习：回归

3650 个评分

案例学习：预测房价

从本节课中

Ridge Regression

You have examined how the performance of a model varies with increasing model complexity, and can describe the potential pitfall of complex models becoming overfit to the training data. In this module, you will explore a very simple, but extremely effective technique for automatically coping with this issue. This method is called "ridge regression". You start out with a complex model, but now fit the model in a manner that not only incorporates a measure of fit to the training data, but also a term that biases the solution away from overfitted functions. To this end, you will explore symptoms of overfitted functions and use this to define a quantitative measure to use in your revised optimization objective. You will derive both a closed-form and gradient descent algorithm for fitting the ridge regression objective; these forms are small modifications from the original algorithms you derived for multiple regression. To select the strength of the bias away from overfitting, you will explore a general-purpose method called "cross validation". <p>You will implement both cross-validation and gradient descent to fit a ridge regression model and select the regularization constant.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

For now, let's assume that somebody gives you the lambda value that we wanna use

in our ridge regression objective and

let's discuss algorithmically how we fit this model.

So in particular, in this part,

we're describing this gray box our machine learning algorithm.

And to start with, just like we've done many times before,

we're first gonna rewrite our objective in matrix notation,

where we assume we have some N observations and

we wanna write jointly what the objective is for all N of these observations.

So let's just recall what we did for our residual sum of squares term.

Where, for our model, we thought about stacking up all N observations and

all the features associated with those N observations in this green matrix.

And that matrix got multiplied by a vector of our regression coefficient,

and then we had this additive noise per observation.

So we wrote this matrix notation as y = H w + epsilon.

Then, when we went to form our residual sum of squares,

we showed that it was equivalent to the following where we have (y-Hw)T(y-Hw).

Okay, so this is our matrix notation for our residual sum of squares term,

but now we wanna do a similar term for our model complexity, penalty,

that we added to our original objective to get a resulting regression objective.

So in particular, we want to write this two norm of w squared in vector notation.

So this two norm of our w vector squared, we said was equal to

w0 squared + w1 squared + w2 squared + all the way up to our Dth feature squared.

And this, is equivalent to taking our w vector,

transpose, meaning putting it as a row, and

multiplying by the w vector itself.

Because if we think about doing this multiplication,

we're gonna get w0 * w0 + w1 * w1 + w2 * w2 etc and that's exactly equivalent.

And so we can write this as w, so I'm trying to write a thick w vector.

W transpose w.

Okay, so this is our vector notation for this model complexity term.

So putting it all together, our ridge regression total cost, for

some N observations, can be written as follows, where we

have (y-Hw) transpose * (y-Hw) + lambda * w transpose w.

Okay, now that we have this, we can start doing what we've done in the past which is

take the gradient and we can think about either setting the gradient to zero to

get a closed form solution, or doing our gradient descent algorithm.

And we're gonna walk through both of these approaches now.

So the first step is computing the gradient of this objective.

So here I'm just writing exactly what we had on the previous slide.

But, now, with these gradient signs in front, and

as we've seen before, the gradient distributes across a sum.

So we have the gradient of this first term plus the gradient of our model complexity

term, or the first term is our measure of fit, that residual sum of squares term.

And we know that the gradient, or

the residual sum of squares has the following form.

-2H transpose (y-Hw).

The question is, what's the gradient of this model complexity term?

What we see is that the gradient of this is 2 * w.

Why is it 2*w?

Well, instead of deriving it, I'll leave that for a little mini challenge for

you guys.

It's fairly straightforward to derive,

just taking partials with respect to each w component.

Just write w transpose w as w0 squared + w1 squared + blah, blah, blah.

All the way up to WD squared, and

then take a derivative just with respect to one of the Ws.

But for now what I'm gonna do is I'm just gonna draw an analogy to

the 1d case where w transpose w is analogous to just w squared.

If w weren't a vector and were instead just a scalar.

And what's the derivative of w squared?

It's 2w.

Okay, so proof by analogy here.

[MUSIC]