案例学习：预测房价

Loading...

来自 华盛顿大学 的课程

机器学习：回归

3443 评分

案例学习：预测房价

从本节课中

Feature Selection & Lasso

A fundamental machine learning task is to select amongst a set of features to include in a model. In this module, you will explore this idea in the context of multiple regression, and describe how such feature selection is important for both interpretability and efficiency of forming predictions. <p> To start, you will examine methods that search over an enumeration of models including different subsets of features. You will analyze both exhaustive search and greedy algorithms. Then, instead of an explicit enumeration, we turn to Lasso regression, which implicitly performs feature selection in a manner akin to ridge regression: A complex model is fit based on a measure of fit to the training data plus a measure of overfitting different than that used in ridge. This lasso method has had impact in numerous applied domains, and the ideas behind the method have fundamentally changed machine learning and statistics. You will also implement a coordinate descent algorithm for fitting a Lasso model. <p>Coordinate descent is another, general, optimization technique, which is useful in many areas of machine learning.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

So, now we're gonna describe what the variant of this coordinate descent

algorithm looks like in the case of lasso.

Again, we're gonna be looking at these normalized features.

And just remember this is where we left off with our

coordinate descent algorithm for just least squares un-regularized regression.

And remember the key point was that we set w hat j equal to row j.

This correlation between our features and the residuals from a prediction,

leaving j out of the model.

Well in the case of lasso what we're gonna do is,

how we set w hat j is gonna depend of the value of our tuning parameter lambda.

And how that relates to this rho j correlation term.

So, in particular if rho j is small, if it's in this minus lambda over

2 to lambda over 2 range, where again what small means is determined by lambda.

What we're gonna do is we're gonna set that w hat j exactly equal to zero.

And here we see the sparsity of our solutions coming out directly here.

But in contrast, if rho j is really large or on the flip side very small,

what that means is that the correlation is either very positive or very negative.

Then we're gonna include that feature in the model.

Just like we did in our least squares solution but

relative to our least squares solution, we're gonna decrease the weight.

So, in the positive case if we've a strong correlation rho j.

Instead of putting w hat j equal to rho j,

we're gonna set it equal to rho j minus lambda over 2.

And on the negative side, we're going to add lambda over 2.

So let's look at this function of how we're setting w hat j visually.

Okay, well this operation that we are performing here in these lasso updates is

something called soft thresholding.

And so, let's just visualize this.

And to do this we're gonna make a plot of rho j,

that correlation we've been talking about,

versus w hat j, our coefficient that we're setting.

And remember, in the least squared solution,

we set w hat j equal to rho j for least squares.

And we can see that she's setting lambda equal to zero.

Remember, lambda equals zero returns us to our least squares solution, so

I'll specifically write least squares there.

So that's why we get this line y equals x, this green line appearing here.

So this represents as a function of rho j how we would set w hat j for

least squares.

And in contrast this fuchsia line here we're showing is for lasso.

And what we see is that in the range minus

lambda over 2 to lambda over 2.

If this correlation is within this range,

meaning that there's not much a relationship between our feature and

the residuals from predictions without feature JNR model,

we're just gonna completely eliminate that feature.

We're gonna set it's weight exactly equal to 0.

But if we're outside that range we're still gonna include the feature

in the model.

But we're gonna shrink the weight on

that feature relative to the least square solution by an amount lambda over 2.

So this is why it's called soft thresholding,

we're shrinking the solution everywhere, but

we're strictly driving it to zero from minus lambda over 2 to lambda over 2.

And I just want to mention to contrast with,

let me choose a color that we don't have here, I guess red will work.

I wanna contrast with the ridge regression solution where you can show, which

we're not going to do here, but you can show that the ridge regression solution.

Shrinks the coefficients everywhere, but never strictly to zero.

So this is the line w hat ridge.

And, Let me just write that this is w hat lasso.

Okay, so here we got a very clear visualization of

the difference between least squares, ridge, and lasso.

[MUSIC]