案例学习：预测房价

Loading...

来自 University of Washington 的课程

机器学习：回归

4020 个评分

案例学习：预测房价

从本节课中

Feature Selection & Lasso

A fundamental machine learning task is to select amongst a set of features to include in a model. In this module, you will explore this idea in the context of multiple regression, and describe how such feature selection is important for both interpretability and efficiency of forming predictions. <p> To start, you will examine methods that search over an enumeration of models including different subsets of features. You will analyze both exhaustive search and greedy algorithms. Then, instead of an explicit enumeration, we turn to Lasso regression, which implicitly performs feature selection in a manner akin to ridge regression: A complex model is fit based on a measure of fit to the training data plus a measure of overfitting different than that used in ridge. This lasso method has had impact in numerous applied domains, and the ideas behind the method have fundamentally changed machine learning and statistics. You will also implement a coordinate descent algorithm for fitting a Lasso model. <p>Coordinate descent is another, general, optimization technique, which is useful in many areas of machine learning.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

Okay, so how are we going to go about this feature selection task?

Well one option we have, is the obvious choice, which is to search over

every possible combination of features we might want to include in our model and

look at the performance of each of those models.

And that's exactly what the all subsets algorithm does and

we're going to describe this now.

Okay, well the all subsets algorithm starts by considering a model with

absolutely no features in it.

Okay, removing all these features we might have in our house and

saying what's the performance of that model?

So just to be clear, start with no features and

there's still a model for no features.

So the model for no features, remember, is just that our observation is simply noise.

Okay, so we can assess the performance of this model on our training data,

and there is some training error associated with a model with no features.

So, we're going to plot that point.

And then the next thing we're going to do,

is we're going to search over ever possible model of just one feature.

So let's say to start with, we consider a model which is number of bedrooms.

And here we're gonna plot the training

error of model fit just with number

of bedrooms as the feature.

Then we're gonna say, well, what's the training error of a model fit just

with number of bathrooms, square feet,

square feet of the lot, and cycle through each one of our possible features.

And at the end of this we can say out of all models with just one feature,

which one fit the training data the best?

And in this case it happened to be the model that included square feet living.

So we've seen that before, that that's a very relevant feature.

Okay, so we're gonna highlight, this is the best fitting model.

With only one feature.

And we're gonna keep track of this model and we can discard all the other ones.

Then we're gonna go and search over all models with 2 features.

So search over all combinations,

Of 2 features.

And we're gonna figure out which one of these has the lowest training error,

keep track of that model.

And that happens to be a model that has number of bedrooms and

number of bathrooms.

And that might make sense, because when we're going to search for a property,

often someone will say, I want a three bedroom house with two bathrooms.

So that might be a reasonable choice for the best model, which is two features.

And I wanna emphasize that the best model,

which is two features, doesn't have to contain any of the features that were

contained in the best model of one feature.

So here, our best model with two features has number of bedrooms,

number of bathrooms.

Whereas in contrast, our best model with just one feature has square feet living.

Okay, so these aren't necessarily nested.

So maybe I'll write this explicitly.

Best model of

size k need not

contain features.

Of best model of size k minus one.

Okay, so hopefully that's clear.

And we're gonna continue our procedure searching over all models then with

3 features, all models with 4 features, 5 features, and at some point what we're

gonna get to is, we're gonna get to a model that has capital D features.

That's all of the features that we include, and there is only one such model.

So it's just one point here.

Then, what we can do is we can draw this line, which represents, the set

is connecting the points which represent the set of all best possible models,

each with a given number of features.

Then the question is, which of these models of these best models with

k features do we want to use for our predictions?

Well hopefully it's clear from this course, as well as from this slide,

that we don't just wanna choose the model with the lowest training error,

because as we know at this point, as we increase model complexity,

our training error is gonna go down, and that's what we're seeing in this plot.

So instead, it's the same type of choices that we've had

previously in this course for choosing between models of various complexity.

One choice,

is if you have enough data you can access performance on the validation set.

That's separate from your training and test set.

We also talked about doing cross validation.

And in this case there many other

metrics we can look at for how to think about penalizing model complexity.

There are things called BIC and a long list of other methods that people have for

choosing amongst these different models.

But we're not going to go through the details of that, for

our course we're gonna focus on this notion of error on the validation set.

[MUSIC]