案例学习：预测房价

Loading...

来自 华盛顿大学 的课程

机器学习：回归

3440 评分

案例学习：预测房价

从本节课中

Feature Selection & Lasso

A fundamental machine learning task is to select amongst a set of features to include in a model. In this module, you will explore this idea in the context of multiple regression, and describe how such feature selection is important for both interpretability and efficiency of forming predictions. <p> To start, you will examine methods that search over an enumeration of models including different subsets of features. You will analyze both exhaustive search and greedy algorithms. Then, instead of an explicit enumeration, we turn to Lasso regression, which implicitly performs feature selection in a manner akin to ridge regression: A complex model is fit based on a measure of fit to the training data plus a measure of overfitting different than that used in ridge. This lasso method has had impact in numerous applied domains, and the ideas behind the method have fundamentally changed machine learning and statistics. You will also implement a coordinate descent algorithm for fitting a Lasso model. <p>Coordinate descent is another, general, optimization technique, which is useful in many areas of machine learning.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

Up to this point, we've derived our coordinate descent algorithm for

these squares and we've discussed what the variant is for lasso.

But we didn't derive explicitly why our coordinate descent algorithm for

lasso has that soft thresholding form.

And for completeness, we're going to include the derivation but

as an optional video.

And I really want to emphasize that this video is optional.

The material is quite advanced.

It's actually material that's typically taught at the level of a Ph.D.

or grad level course.

So, watch it at your own risk.

We think it's still very interesting

if you want to see what's under the hood in the lasso solver.

But that's our little word of warning before we delve into this content,

because again this is an example where the level of what we're gonna present in this

video goes much more technical than what we're assuming for

the rest of this specialization.

Okay.

So, let's get into optimizing our lasso objective which we've written here And

we're doing it again one coordinate at a time to derive or

coordinate descend algorithm.

And in this derivation we're gonna use unnormalized features so

we're not doing the normalization.

This is just our typical h that's appearing here.

And we're doing that because that's the most general case we can derive and

the normalized case follows directly from this derivation.

Okay, so to start with let's look at the partial of our residual summer squares

term with respect to wj and we did this derivation when we're looking at deriving

corner to center lease squares, but we did it for Normalized features so,

let's do it now for un-normalized features where remember what we're gonna do is just

pull out this w,j term, so we're gonna get sum i equals one to n.

H, j of x, i and I'm gonna skip a step

assuming that you remember what we

did On that previous video where I'm

separating out K not equal to J,

W K H K X I from this term where we get

the The wjhj of Xi and so

when we do this multiplication we're

gonna get minus two sum over i equals one to n hjXi.

Times YI minus Sum K not equal to JWKHKXI.

And here we're gonna get plus two.

Again the WJ comes out, we get Sum I equals one to N.

And hj(xi) squared.

And when we were talking about normalized features,

we said the sum of the hj squared was equal to one.

But now we're looking at unnormalized features.

So let's define some terms.

Again, we're gonna call this term, even though in this case we're looking at

unnormalized features, we're still gonna call this row j.

And here, now, we have to define what we're calling this normalizer,

and we're gonna call this zi.

I mean, sorry, not zi, zj.

Okay So,

the result.

I go back to my blue colors, we get -2 row J just like before,

but now we get +2 WjZj.

So, that completes our residual sum of squares term.

Now, let's turn to our L1 penalty.

And this is where things become more complicated.

So, in particular what's the partial with respect to the absolute value of wj?

Remember, that's all the other terms we're thinking of as held fixed.

So, we're just looking at this Wj component.

Well, this is where we get these derivatives of

absolute values that become problematic.

Remember here we have derivative.

Is equal to minus one and here we have derivative equals

plus one and here we have this problem point.

And we mentioned that instead of thinking about Gradients or these partials,

we think about sub gradients.

Okay, so now let's talk about these sub gradients and formalize this a bit.

But let's first mention that gradients,

we can think of as lower bounding convex functions.

So, if I look at My convex function at some points A and B.

If I take the gradient at my point a that's this tangent plane.

So, this is gradient of G at A.

Then what I have is I have that G at B.

So, just to be clear, this is g(a), g(b).

So, g(b) is greater than or equal to g(a)

plus gradient of

g(a) times (b-a).

And importantly,

gradients are unique at x if the function is differentiable at x.

While subgradients generalize this idea of gradients to non-differentiable points.

And a gradient is going to be any plane that lower bounds this function.

So, sorry not a gradient, if I said a gradient i mean a sub-gradient.

So, a sub-gradient is really a set,

its a set of all the plane that lower bound a function.

So, we're gonna saw that v is in our set

which we denote by this curly d of g at a point x.

So, this is our sub radiant of

g at x If we have that g of b is greater than,

or equal to g of a, plus,

in place of this gradient here,

we're writing V.

So, here this was the one function that lower

bounded our Sorry, the one plane that lower bounded our function.

And here this is one of the planes that lower bounds our function.

And again we have b- a.

So, this is the definition of a sub-gradient.

And let's look at it in the context of this absolute value function.

So what are all planes that lower bound this absolute value function.

Well, All the planes that I'm

gonna draw lower bound this absolute value function,

there's this plane, there's this plane,

there's This plane with positive slope.

And I could fill this space here with all these planes.

And what are the slopes of these planes?

So, here we know that The slope is equal to -1 and

here the slope is is equal to +1.

And we see that anything that has a slope in the range of -1 to 1, any

line in the range -1 to 1, is gonna lower bound this absolutely value function.

So, For absolute value of x,

we have that v is in minus one to one.

So, minus one to one represents our subgradient of the absolute value of x.

So now that we've had our detour into the world of subgradients.

We can start talking- instead of the partial of this L1 term, we can talk about

the subgradient of this L1 term where here we get Lambda times the subgradient.

We already know that subgradient are the absolute value is

the range minus one to one.

So, lambda times the subgradient of the absolute value is going to be

Minus lambda to lambda when w j is equal to 0.

Remember a subgradient is defined at that problem point w j equal 0, but

in the case where the derivative exists or that partial exists it's just going

to be lambda times in the case of w less than 0 we had minus 1.

So this is lambda times minus 1.

Here we have lambda times minus 1 to 1 and

the case where we're in this positive half plane we get, lambda times 1.

Okay so this is our complete lambda times the sub-gradient of our

L1 objective with respect to wj And

now that we have the partial of our residual sum of squares term and

the subgradient of our L1 term, we can put it all together and

get our subgradient of our entire lasso cost, with respect to wj,

and here this part, this is from our residual sum of squares.

Whereas this part is from the L1 penalty.

Or really landa times the L1 penalty.

And when we put these things together we get three different cases.

Because of the three cases for the L1 objective.

So, we get 2zj that normalizer.

Wj-rho j- lambda.

It won't read all of the different cases.

But this now is our subgradient of our entire Lasso solution.

Okay, well, let's not get lost in the weeds here.

Let's pop up a level and say remember before we would take the gradient and

set it equal to 0, or in the coordinate descent algorithm we talked about

taking the partial with respect to one coordinate

then setting that equal to zero to get the update for that dimension.

Here now instead of this partial derivative we're

taking the subgradient and we have to set it equal to zero.

To get the update for this specific jth coordinate.

So, let's do that, so we've taken our subgradient, we're setting it equal to 0.

Again we have three different cases we need to consider.

So, in Case 1, where wj is less than 0, let's solve for w at j.

You get w at j Equals 2 rho

j + lambda divided by 2 z j,

which I'm just gonna rewrite as rho j + lambda over 2, divided by z j.

So, I've multiplied the top and bottom by one-half here.

Okay, but to be in this case,

to have W to have J less than zero, we need a constraint on row J.

So if row J is less than

minus lambda over two, remember this is that correlation term.

If that and that's something we can compute because that's a function of

all the other variables except for wj.

So, if ro j is less than minus lambda over 2 then we know that

w hat j will be less than 0 according to this formula.

Okay, then we get to the second case, which is wj = 0.

In that case, where you've already solved for w hatch a, there's only one solution.

But in order to have that be the optimum,

we know that this subgradient when wj = 0

The subgradient has to contain zero otherwise we would never get that,

this is the case that is equal to zero that it's an optimum.

So, we need for this range to contain zero,

so that w hat j equals 0 is an optimum.

Of our objective.

And for that to be true, we need minus 2 row j plus lambda to be greater than 0.

So, we need this upper term of the interval greater than 0,

which is equivalent to saying that Rho J is less

than lambda over two and we need this bottom interval to be less than zero.

So, minus two rho J minus lambda less than zero which is equivalent to saying

rho J is greater than Minus lambda over two.

And if we put these together what this is saying is that row j is less

than lambda over two and greater than

minus lambda over two And actually we could put the equal sign here.

So, let's just do, so it has to be less than or equal to, less than or

equal to, less than or equal to.

Okay, and our final case,

let's just quickly work through it we get w hat j equals.

Row l- lambda over 2 divided by Zj

and in order to have W hat J be greater than 0,

we need row j would be greater than lambda over two.

So, let me just talk through this kind of in the other direction now that we've done

the derivation which is saying if rho J is less than minus lambda over two,

then we'll set W hat J as follows.

If row j is in this interval,

we'll set w hat j equal to 0, and if row j is greater than lambda over 2,

we're gonna set w hat j as this third case.

Okay, so, this slide just summarizes what I just said.

So, this is our optimal 1 D optimization for this lasso objective.

So, let's talk about this more general form of the soft thresholding rule for

lasso in the case of our unnormalized features.

So, remember for our normalized features, there wasn't this CJ here And

what we ended up with for our least square solution when lambda

was equal to 0 was just this line, w-hat j equals rho j.

But now what does our least squares line look like?

Well again, we can just set lambda equal to 0, and we see that this Lee squares

line w hat lee squares is equal to row j over z j.

Remember z j is that normalizer so I mean over the square of all of our features.

So, that number will be positive and it's typically gonna be larger than one.

Potentially much larger than one So, relative to a slope,

which is a 45 degree angle slope of 1,

I'm saying that this line is shrunk more this way,

typically, in the case of unnormalized features.

And then, when I look at my lasso solution, w hat.

Lasso in this case.

Again, in the range minus lambda over 2.

Sorry, that is clearly not minus lambda over 2.

This is minus lambda over 2, to lambda over 2.

I get this thresholding of the coefficients exactly to zero,

relative to My least square solution, and outside that range,

the difference between the least square solution and

my lasso solution, is that my coefficients are each

shrunk by an amount lambda over lambda 2zj.

Okay.

But remember that rho_j here as compared to when we talked about

normalized features was defined differently.

It was defined in terms of our unnormalized features.

So, for the same value of lambda that you would use with normalized features

you're getting a different relationship here.

A different range where things are set to zero.

S,o in summary, we've derived this coordinate descent algorithm for

lasso in the case of unormalized features.

And in particular, the key insight we had was instead of just taking the partial of

our objective with respect to WJ we had to take the subgradient of our

objective with respect to Wj, and that's what leads to these three different cases,

because the gradient itself is defined for

every value of Wj, except for this one critical point, Wj = 0.

But in particular we also had a lot of insight into how this soft

thresholding gives us the sparsity in our lasso solutions.

[MUSIC]