案例学习：预测房价

Loading...

来自 华盛顿大学 的课程

机器学习：回归

3449 评分

案例学习：预测房价

从本节课中

Feature Selection & Lasso

A fundamental machine learning task is to select amongst a set of features to include in a model. In this module, you will explore this idea in the context of multiple regression, and describe how such feature selection is important for both interpretability and efficiency of forming predictions. <p> To start, you will examine methods that search over an enumeration of models including different subsets of features. You will analyze both exhaustive search and greedy algorithms. Then, instead of an explicit enumeration, we turn to Lasso regression, which implicitly performs feature selection in a manner akin to ridge regression: A complex model is fit based on a measure of fit to the training data plus a measure of overfitting different than that used in ridge. This lasso method has had impact in numerous applied domains, and the ideas behind the method have fundamentally changed machine learning and statistics. You will also implement a coordinate descent algorithm for fitting a Lasso model. <p>Coordinate descent is another, general, optimization technique, which is useful in many areas of machine learning.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

Well, for our third option for feature selection,

we're gonna explore a completely different approach which is using regularized

regression to implicitly perform feature selection for us.

And the algorithm we're gonna explore is called Lasso.

And it's really fundamentally changed the field of machine learning, statistics, and

engineering.

It's had a lot of, lot of impact, in just a number of applications.

And it's a really interesting approach.

Let's recall regularized regression, and the context of ridge regression first.

Where, remember, we were balancing between the fit of our model on our training data

and a measure of the magnitude of our coefficients,

where we said that smaller magnitudes of coefficients

indicated that things were not as overfit as if you had crazy, large magnitudes.

And we introduced this tuning parameter,

lambda, which balanced between these two competing objectives.

So for our measure of fit, we looked at residual sum of squares.

And in the case of ridge regression,

when we looked at our measure of the magnitude of the coefficients,

we used what's called the L2 norm, so this is just the two norm squared in this case,

which is the sum of each of our feature weights squared.

Okay, this ridge regression penalty we said encouraged our weights to be small.

But one thing I want to emphasize is that I encourage them to be small but

not exactly 0.

We can see this if we look at the coefficient path that we described for

ridge regression, where we see the magnitude of our coefficients

shrinking and shrinking towards 0, as we increase our lambda value.

And we said in the limit as lambda goes to infinity, in that limit,

the coefficients become exactly 0.

But for any finite value of lambda, even a really really large value of lambda,

we're still just going to have very, very, very small coefficients but

they won't be exactly 0.

So why does it matter that they're not exactly 0?

Why am I emphasizing so much this concept of the coefficients being 0?

Well, this is this concept of sparsity that we talked about before,

where if we have coefficients that are exactly 0, well then,

for efficiency of our predictions, that's really important because we can just

completely remove all the features where their coefficients are 0 from

our prediction operation and just use the other coefficients and the other features.

And likewise, for interpretability, if we say that one of the coefficients is

exactly 0, what we're saying is that that feature is not in our model.

So that is doing our feature selection.

So a question though, is can we use regularization to get at this idea

of doing feature selection, instead of what we talked about before?

Where before, when we're talking about all subsets, or greedy algorithms, what we

were doing is we were searching over a discrete set of possible solutions, we're

searching over the solution that included the first and the fifth feature, or

the second and the seventh, or this entire collection of these discrete solutions.

But we'd like to ask here is whether we can start with for

example, our full model.

And then just shrink some coefficients not towards 0, but exactly to 0.

Because if we shrink them exactly to 0, then we're knocking out those

coefficients, we're knocking those features out from our model.

And instead, the non-zero coefficients are going to indicate our selected features.

[MUSIC]