案例学习：预测房价

Loading...

来自 University of Washington 的课程

机器学习：回归

3589 个评分

案例学习：预测房价

从本节课中

Feature Selection & Lasso

A fundamental machine learning task is to select amongst a set of features to include in a model. In this module, you will explore this idea in the context of multiple regression, and describe how such feature selection is important for both interpretability and efficiency of forming predictions. <p> To start, you will examine methods that search over an enumeration of models including different subsets of features. You will analyze both exhaustive search and greedy algorithms. Then, instead of an explicit enumeration, we turn to Lasso regression, which implicitly performs feature selection in a manner akin to ridge regression: A complex model is fit based on a measure of fit to the training data plus a measure of overfitting different than that used in ridge. This lasso method has had impact in numerous applied domains, and the ideas behind the method have fundamentally changed machine learning and statistics. You will also implement a coordinate descent algorithm for fitting a Lasso model. <p>Coordinate descent is another, general, optimization technique, which is useful in many areas of machine learning.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

But instead of gradient descent or sub-gradient descent,

let's take this as an opportunity to teach a really important alternative

optimization procedure called Coordinate descent.

So, let's just have a little aside on the coordinate decent algorithm, and

then we're gonna describe how to apply coordinate descent to solving our

lasso objective.

So, our goal here is to minimize sub function g.

So, this is the same objective that we have whether we are talking about our

closed form solution, gradient descent, or this coordinate descent algorithm.

But, let me just be very explicit, where.

We're saying we wanna minimize over all possible w some g(w),

where here, we're assuming g(w) is function of multiple variables.

Let's call it g(w0,w1,...,wD).

So this W I'm trying to write in some bold font here.

And often,

minimizing over a large set of variables can be a very challenging problem.

But in contrast, often it's possible to think about

optimizing just a single dimension, keeping all of the other dimensions fixed.

So easy for each coordinate when keeping

others fixed,

because that turns into just a 1D optimization problem.

And so, that's the motivation behind coordinate decent,

where the coordinate descent algorithm, it's really intuitive.

We're just gonna start by initializing our w vector,

which will denote w hat equal to 0.

Or you could use some smarter initialization, if you have it.

And then while the algorithm's not converged,

we're just gonna pick a coordinate out of 0, 1 all the way to d.

And we're gonna set W hat J equal to the W that minimizes.

I'm gonna write this as an omega, so

I'm gonna search over all possible values of

omega that minimizes g of w hat zero dot dot dot,

all the way to W hat j minus one, coma omega,

coma w hat at j plus one all the way up to w hat D.

We're here, these values,

everything with a hat on it are values from previous

iterations of this coordinate to send algorithm.

And our objective here, we're just minimizing over this J coordinate.

Sorry, these J's are pretty sloppy here.

Okay, so whatever omega value minimizes this joint objective,

plugging omega in, is gonna be what we set w hatch a equal to at this iteration,

and then were gonna keep cycling through until conversions.

So, let's look at a little illustration of this just in two dimension.

So, here it would just be W0, W1.

And we're gonna pick a dimension, so let's say,

we choose this optimizing the W1 dimension, keeping W0 fixed.

So I'm just gonna look at this slice, and what's the minimum along this slice?

Well the minimum occurs here,

if I look at the contours this is the minimum along this dimension, and

then what I'm gonna is I'm, in this case just gonna cycle through my coordinates.

I'm next gonna look at keeping W1 fixed and optimizing over W0, and

here was the minimum, I'm gonna drop down to the minimum which is here and

I'm gonna optimize over W1 holding W0 fixed to this value.

And I'm gonna keep taking these steps and

if I look at the path of coordinate descent,

it's gonna look like the following, where I get this axis aligned moves.

So how are we gonna pick which coordinate to optimize next?

Well there are a couple choices.

One is to just choose coordinates randomly.

This is called either random or stochastic coordinate descent.

Another option is round robin, where I just cycle through, zero, one,

two, three, all the way up to D and then cycle through again.

And there are other choices you can make as well.

And, I want to make a few comments about coordinate descent.

The first is, it's really important to note that there is

no stepsize that we're using in this algorithm.

So, that's in contrast to gradient descent,

where we have to choose our stepsize or stepsize schedule.

And hopefully, in the practice that you've had,

you've seen the importance of carefully choosing that step size.

So it's nice not having to choose that parameter.

And this algorithm is really useful in many, many problems.

And you can show that this coordinate descent algorithm converges for

searching objective functions.

One example of this is, if the objective is strongly convexed, and

we talked about strongly convexed functions before.

Another example is, for lasso which is very important and

that's why we're talking about using coordinate descent for lasso.

We're not gonna go through proof of convergence in this case, but

know that it does indeed converge.

[MUSIC]