案例学习：预测房价

Loading...

来自 华盛顿大学 的课程

机器学习：回归

3449 评分

案例学习：预测房价

从本节课中

Feature Selection & Lasso

A fundamental machine learning task is to select amongst a set of features to include in a model. In this module, you will explore this idea in the context of multiple regression, and describe how such feature selection is important for both interpretability and efficiency of forming predictions. <p> To start, you will examine methods that search over an enumeration of models including different subsets of features. You will analyze both exhaustive search and greedy algorithms. Then, instead of an explicit enumeration, we turn to Lasso regression, which implicitly performs feature selection in a manner akin to ridge regression: A complex model is fit based on a measure of fit to the training data plus a measure of overfitting different than that used in ridge. This lasso method has had impact in numerous applied domains, and the ideas behind the method have fundamentally changed machine learning and statistics. You will also implement a coordinate descent algorithm for fitting a Lasso model. <p>Coordinate descent is another, general, optimization technique, which is useful in many areas of machine learning.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

Well, let's see this sparsity in action.

And for these, we're gonna go back to our polynomial regression example.

But instead of just doing least squared for increasing polynomial orders or

looking at ridge regression, we're now gonna look at our lasso solution.

Again, for different values of our tuning parameter lambda.

Here we are back at our polynomial regression demo and

for this lasso regression example,

all we have to do is take our polynomial regression function and modify it,

because we're using graph I've created in just this dot linear regression.

Remember, there was this l2 penalty and l1 penalty.

Well, now we fully understand what these two penalties are and

we know that for ridge, we were looking at the l2 penalty.

But for lasso, the l2 penalty is gonna be set to zero and instead,

we're gonna focus on this l1 penalty.

And in this demo, we're gonna explore different values of that l1 penalty.

And I also wanna mention that here, we're specifying a solver.

So Vista is one solver.

We're gonna look at other ways of optimizing the lasso objective in this

module, but we can of this as just a fancy version of gradient descent.

So this is our polynomial lasso regression function and

now we're just gonna explore a set of l1 penalty values.

So again, these are lambda values going from 1 e to the minus 4,

all the way up to a value of 10.

And like in our ridge regression demo,

we're starting with our 16th order polynomial,

which was that really crazy fit in the least squares unregularized case.

But now, let's think about what is going to happen in this lasso scenario.

And so what we see is that when the penalty strength is

really small, we don't get any sparsity at all.

So we have 16th order polynomial and what we're seeing is that all 17 coefficients,

cuz there's 16 orders of x plus this intercept term.

So in total, 17 coefficients and all 17 of them are non-zero and

that makes sense, because as lambda becomes really, really small.

Remember, we default back to our least square solution

where we don't have any coefficients set exactly to zero.

But as we increase this lambda value, we get more and

more sparsity in our solution.

So the number of nonzeroes here is 14,

then we get five coefficients being nonzero.

And by the time we have a penalty strength of ten,

we only have two of our coefficients being nonzero.

So you can see very explicitly from this how lasso is leading to sparse solutions,

especially as you're increasing the strength of that l1 penalty term.

But now,

let's just look at the fits associated with these different estimated models.

So this is for our very small penalty value.

This function doesn't look as crazy as the least square solution,

still fairly wiggly.

But remember that in lasso just like in ridge,

the coefficients are shrunk relative to the ridge regression solution.

So even in this case where we don't have any sparsities and none of

the features have been knocked out of our model, we still have that the coefficients

are a little bit shrunk relative to those of the least square solution and

that's providing enough regularization to lead to the smoother fit in this case.

But as we increase this lambda value,

we see that we actually do get smoother and smoother fits.

This starts to look like the fit that we had for

our optimal setting of our ridge regression objective or

the one that minimized or leave one out across validation error.

But again, when we get to really large lambdas just like in ridge regression,

we start to have fits that are over smoothing things.

So this was a case where we only had two nonzero coefficients and we see that it's

really just insufficient for describing what's going on in this case.

So again, to choose our lambda value here, we could do the same.

Leave one out cross validation, we did in our ridged regression demo.

But the point that I wanted to show here is how we get

these sparse solutions where we're knocking out, in this case,

different powers of x in our polynomial regression fit.

And this is in contrast to ridge regression,

which simply shrink the coefficient of each one of these powers of x in our

degree 16 polynomial fit.

[MUSIC]