案例学习：预测房价

Loading...

来自 University of Washington 的课程

机器学习：回归

3588 个评分

案例学习：预测房价

从本节课中

Feature Selection & Lasso

A fundamental machine learning task is to select amongst a set of features to include in a model. In this module, you will explore this idea in the context of multiple regression, and describe how such feature selection is important for both interpretability and efficiency of forming predictions. <p> To start, you will examine methods that search over an enumeration of models including different subsets of features. You will analyze both exhaustive search and greedy algorithms. Then, instead of an explicit enumeration, we turn to Lasso regression, which implicitly performs feature selection in a manner akin to ridge regression: A complex model is fit based on a measure of fit to the training data plus a measure of overfitting different than that used in ridge. This lasso method has had impact in numerous applied domains, and the ideas behind the method have fundamentally changed machine learning and statistics. You will also implement a coordinate descent algorithm for fitting a Lasso model. <p>Coordinate descent is another, general, optimization technique, which is useful in many areas of machine learning.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

Okay, well in place of our ridge regression objective.

What if we took our measure of our magnitude of our coefficients

to be what's called the l1 norm.

Where we're gonna sum over the absolute value of each one of our coefficients.

So, we actually describe this as a reasonable measure of the magnitude of

the coefficients when we're discussing ridge regression last module.

Well, the result of this is something that leads to sparse solutions.

For reasons that we're gonna go through in the remainder of this module.

And this objective is referred to as Lasso regression.

Or L1 regularized regression.

So, just like in ridge regression, lasso is governed by a tuning parameter,

lambda, that controls how much we're favoring sparsity of our solutions

relative to the fit on our training data.

And so, just to be clear, here,

we see that when we're doing our feature selection task,

we're searching over a continuous space, this space of lambda values.

Lambda's governing the sparsity of the solution and that's in contrast to,

for example, the all subsets or greedy approaches,

where we talked about those searching over a discrete set of possible solutions.

So, it's really a fundamentally different approach to doing feature selection.

Okay but let's talk about what happens to our solution as we vary lambda.

And again just to emphasize,

this lambda is a tuning parameter that in this case is balancing fit and sparsity.

Okay so if lambda is equal to zero, what's gonna happen?

Well, this penalty term is completely going to disappear, and

our objective is simply going to be to minimize residual sum of squares.

That was our old least squares objective.

So, we're going to get W hat what I'll call lasso.

The solution to our lasso problem

is going to be exactly equal to W hat least squares.

So, this is equal to our unregularized solution.

And in contrast if we set lambda equals to infinity.

This is where we are completely favoring.

This magnitude penalty in completely ignoring the residual square is fit.

In this case, what's the thing that minimizes the L1 norm.

So, what value of our regression coefficients

is gonna have some other absolute value is being the smallest.

Well again or just like in ridge when lambda's equal to infinity we're gonna

get W hat lasso equal to the zero vector.

And if lambda is in between we're gonna get that in this case the one norm

of our lasso solution

It's gonna be less than or equal to the one norm of our lease square solution and

it's gonna be greater than or equal to this zero vector.

I mean, this zero number.

Sorry.

Here it's just a number once we've taken this norm.

Okay.

So, as of yet, it's not clear why this L 1 norm is leading to sparsity,

and we're going to get to that, but let's first just explore this visually.

And one way we can see this is from the coefficient path.

But first, let's just remember the coefficient path for

ridge regression, where we saw that even for a large value of lambda

Everything was in our model, just with small coefficients.

So, everything has W hat J

greater than zero but all W hat J.

Are small for

large values of our tuning parameter lambda.

In contrast, when we look at the coefficient path for

lasso, we see a very different pattern.

What we see is that at certain critical values of this tuning parameter lambda.

Certain ones of our features jump out of our model.

So, for example here we had square feet of the lot size disappears from the model.

Here number of bedrooms almost simultaneously with number of floors and

number of bathrooms.

Followed by the year the house was built.

And then, but one thing that we see, so let me just be clear,

that for let's say a value of lambda like this,

we have a sparse set of features included in our model.

So, the ones I've circled.

Are the only feature, sorry.

Only features in our model.

And all the other ones, have dropped completely exactly to zero.

And one thing that we see is that when lambda is very large, like the large value

I showed on the previous plot, the only thing in our model is square feet living.

And note that square feet living still has a really significantly large weight on it.

So, I'll say large

weight on square feet

living when everything

else is out of the model.

Meaning not included in the model.

So, square feet living is still very valuable to our predictions,

and it would take quite a large lambda value to say that

square feet living, even that was not relevant.

Eventually, square feet living would be shrunk exactly to 0.

But for a much large value of land.

But, if I go back to my ridge regression solution.

I see that I had a much smaller value on square feet living,

because I was distributing weights across many other features in the model.

So, that individual impact of square feet living wasn't as clear.

[MUSIC]