案例学习：预测房价

Loading...

来自 华盛顿大学 的课程

机器学习：回归

3446 评分

案例学习：预测房价

从本节课中

Feature Selection & Lasso

A fundamental machine learning task is to select amongst a set of features to include in a model. In this module, you will explore this idea in the context of multiple regression, and describe how such feature selection is important for both interpretability and efficiency of forming predictions. <p> To start, you will examine methods that search over an enumeration of models including different subsets of features. You will analyze both exhaustive search and greedy algorithms. Then, instead of an explicit enumeration, we turn to Lasso regression, which implicitly performs feature selection in a manner akin to ridge regression: A complex model is fit based on a measure of fit to the training data plus a measure of overfitting different than that used in ridge. This lasso method has had impact in numerous applied domains, and the ideas behind the method have fundamentally changed machine learning and statistics. You will also implement a coordinate descent algorithm for fitting a Lasso model. <p>Coordinate descent is another, general, optimization technique, which is useful in many areas of machine learning.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

As a last aside, before we get to our coordinate descent algorithm for Lasso.

Let's first describe, or

rather derive our coordinate descent algorithm for just least

squares regression using these normalized features that we've just described.

So here's our least squares objective,

which is just minimizing our residual sum of squares.

And when we derive our coordinates to send algorithm,

remember that we're going to look at a specific variable, wj.

We're going to hold all other variables fixed.

And we're going to optimize just with respect to wj.

So what we're going to do, is we're going to take the partial with respect to wj.

And then think about setting that equal to zero, to find the optimal value for

wj, fixing everything else.

So this is our statement of just our 1d optimization.

Coordinate by coordinate.

Okay, so

let's write our partial with respect to wj of our residual sum of squares.

And we've derived this in various forms multiple times in this course so

I'm not gonna re-derive it again.

But, I'm just writing it here and

then we're gonna manipulate this into the form that we need in this specific case.

And I wanna highlight those first before I move on that this underbar notation.

It was actually introduced when we're talking about our normalized features.

So, these are our normalized features.

That's what the underbar represents.

So, everything i'm doing here is in the context of normalized features.

Okay, so we can rewrite the subjective.

What I wanna do is, I specifically wanna pull out this wj term.

That's the term I'm solving for, everything else is assumed fixed.

So when I'm thinking about my coordinate distant algorithm,

I know the values of all the other wjs.

I guess I should mention what this notation means,

this minus j means all wk for k not equal to j.

So that's my shorthand saying I'm just taking my w vector and

subtracting off, just removing from that set that jth element wj.

Okay, I'll remove the circle so that I don't cover it up too much.

Okay, so I'm assuming that W minus j, those numbers are known,

those values are known at any given iteration of my algorithm, so

I wanna separate out wj, which is my variable,

and solve for everything in terms of these other quantities.

So let's do that, so I have minus

two Sum i equals one to N, h bar j of xi.

And I'm just gonna rewrite

this as yi minus sum over k

not equal to j Of wk h-bar k of xi.

So that's everything except for wj and

I'm gonna write the wj term this specific j that I'm looking at here, separately.

So h bar j of xi, and then I'm just gonna multiply through.

And so what I'm gonna get is I'm gonna get minus two sum

I equals one to N H bar J X I

times Y I minus sum K not

equal to J W K H bar K X I.

So that's the first term.

And then this minus sign and this minus sign are going to turn into a plus two,

and then the sum over i.

Well wj doesn't depend on i, so that's going to come out, and then I'm going to

have a sum i=1n, and I have two hj's.

I have hj here multiplying this hj here.

So h-bar, j squared, of xi.

Or just to match the notation we did before,

I'll say hj of xi squared.

My features are normalized, so

let me just switch to another color here, lets do purple, so.

What is the sum of the squared values of my normalized features?

Well by definition of normalization, this is equal to one.

Features, this is equal to one.

So, and the other thing I'm gonna do is I'm gonna define this term here.

I'm just gonna use this notation triangle equal meaning,

I'm defining this notation, I'm gonna call this row j.

And the last thing I want to mention is this here.

What is this term here?

I'm looking at the sum over all features not equal to j

of w k times my feature k.

Well that's my predicted value of the ith observation if I

excluded the jth feature from the model.

So this is my predicted value without including this jth feature, okay.

Well let's just rewrite what we've done.

So the result is we get minus two

row j plus two wj times one okay?

So this is my partial with respect to wj of my residual sum of squares.

And what I'm gonna do like I said is I'm gonna set this partial equal to zero and

I'm going to solve for wj.

So we get, remember when we set it equal to zero,

we're going to start calling things w hat.

So, these twos cancel out and

very straight forwardly I get w hat j equals row j.

Okay, so now I've derived my coordinate descent algorithm.

And i'm showing this coordinate descent algorithm here.

So what we see is we just initialize our w hat to zero, while not converged.

We're just gonna cycle through, we're doing round robin here,

cycling through each one of our coordinates.

We're gonna compute this row j term for a given coordinate j,

and then we're simply gonna set w hat j equal to row j.

So let's discuss this row j term and

give a little intuition behind why we're setting w hat j equal to row j.

Well what row j is doing is it's a measure of the correlation

between our future j and this residual between the predicted value.

Where remember that prediction was formed excluding that jth feature from

the model and the true observation.

So we're looking at our prediction,

the difference relative to the truth, when j is not included in the model.

And then we're seeing how correlated is that with the jth feature.

And if that correlation is really high, so if they tend to agree,

then what that means is that jth feature is important for the model.

And it can be added into the model and reduce the residuals.

So equivalently improve our predictions, so what this is going to mean.

If these things tend to agree, row j is going to be really large,

high correlation.

And we're gonna to set w hat j large,

we're gonna put a weight strongly on this feature being in the model.

But in contrast if they're uncorrelated, if they don't tend to agree across

observations, or if the residuals themselves are already very small,

because the predictions are already very good.

Then what it's saying is that row j is gonna be small and as a result,

we're not gonna put much weight on this feature in the model.

[MUSIC]