案例学习：预测房价

Loading...

来自 华盛顿大学 的课程

机器学习：回归

3517 评分

案例学习：预测房价

从本节课中

Feature Selection & Lasso

A fundamental machine learning task is to select amongst a set of features to include in a model. In this module, you will explore this idea in the context of multiple regression, and describe how such feature selection is important for both interpretability and efficiency of forming predictions. <p> To start, you will examine methods that search over an enumeration of models including different subsets of features. You will analyze both exhaustive search and greedy algorithms. Then, instead of an explicit enumeration, we turn to Lasso regression, which implicitly performs feature selection in a manner akin to ridge regression: A complex model is fit based on a measure of fit to the training data plus a measure of overfitting different than that used in ridge. This lasso method has had impact in numerous applied domains, and the ideas behind the method have fundamentally changed machine learning and statistics. You will also implement a coordinate descent algorithm for fitting a Lasso model. <p>Coordinate descent is another, general, optimization technique, which is useful in many areas of machine learning.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

Okay, well maybe we can just take our retrogression solution, and

just take all the little coefficients and just say they're 0, just get rid of those.

We're gonna call those thresholding those away.

We're gonna choose some value where, below that value of the magnitude of

the coefficient is below the threshold that we choose,

just going to say it's not in the model.

So, let's explore this idea a little bit.

So, here I'm just showing an illustration of a little cartoon of what

the weights might look like on a set of features in our housing application.

And, I'm choosing some threshold which is this dashed black line.

And if the magnitude exceeds that threshold,

then I'm gonna say that features in my model.

So here in, pink or fuchsia.

Carlos, what color is this?

>> Fuchsia.

>> Fuchsia.

This is Carlos's color scheme.

He's very attached to it, fuchsia.

So in fuchsia, I'm showing the features that have been selected to be

in my model after doing this thresholding of my ridge regression coefficients.

Might seem like a reasonable approach, but let's dig into this a little bit more.

And in particula,r let's look at two very related features.

So if you look at this list of features, you see, in green,

I've highlighted number of bathrooms and number of showers.

So, these numbers tend to be very, very close to one another.

Because lots of bathrooms have showers, and

as the number of showers grow, clearly the number of bathrooms grow

because you're very unlikely to have a shower not in a bathroom.

But what's happened here?

Well, our model has included nothing having to do with bathrooms, or showers,

or anything of this concept.

So that doesn't really make a lot of sense.

To me it seems like something having to do with how many bathrooms are in the house

should be a valuable feature to include when I'm assessing the value of the house.

So what's going wrong?

Well, if I hadn't included number of showers.

Let's just for simplicity's sake,

treat the number of showers as exactly equivalent to the number of bathrooms.

It might not be exactly equivalent, but they're very strongly related.

But like I said for simplicity, let's say they're exactly the same.

So if I hadn't included number of showers in my model to begin with,

in the full model, then when I did my ridge search, it would've placed that

weight that had been on number of showers, on the number of bathrooms.

Because remember, it's a linear model, we're summing over

weight times number of bathrooms plus weight times number of showers.

So if number of bathrooms equals number of showers,

it's equivalent to the sum of those two weights just times number of bathrooms,

excluding number of showers from the model.

Okay so, the point here is that if I hadn't included this redundant feature,

number of showers, what I see now visually, is that number of bathrooms

would have been included in my selected model doing the threshholding.

So, the issue that I'm getting at here.

It's not specific to the number of bathrooms and number of showers.

It's an issue that, if you have a whole collection, maybe not two,

maybe a whole set of strongly related features.

More formally, statistically I will call these strongly correlated features.

Then ridge regression is gonna prefer a solution that places a bunch of smaller

weights on all the features, rather than one large weight on one of the features.

Because remember the cost

under the ridge regression model is the size of that feature squared.

And so if you have one really big one, that's really gonna blow up that cost,

that L2 penalty term.

Whereas the fit of the model is gonna be basically about the same.

Whether I distribute the weights over redundant features or

if I put a big one on just for one of them and zeros elsewhere.

So what's gonna happen is I'm going to get a bunch of these small weights over

the redundant features.

And if I think about simply thresholding,

I'm gonna discard all of these redundant features.

Whereas one of them, or

potentially the whole set, really were relevant to my prediction task.

So hopefully, it's clear from this illustration that just taking ridge

regression and thresholding out these small weights,

is not a solution to our feature selection problem.

So instead we're left with this question of,

can we use regularization to directly optimize for sparsity?

[MUSIC]