案例学习：预测房价

Loading...

来自 华盛顿大学 的课程

机器学习：回归

3440 评分

案例学习：预测房价

从本节课中

Ridge Regression

You have examined how the performance of a model varies with increasing model complexity, and can describe the potential pitfall of complex models becoming overfit to the training data. In this module, you will explore a very simple, but extremely effective technique for automatically coping with this issue. This method is called "ridge regression". You start out with a complex model, but now fit the model in a manner that not only incorporates a measure of fit to the training data, but also a term that biases the solution away from overfitted functions. To this end, you will explore symptoms of overfitted functions and use this to define a quantitative measure to use in your revised optimization objective. You will derive both a closed-form and gradient descent algorithm for fitting the ridge regression objective; these forms are small modifications from the original algorithms you derived for multiple regression. To select the strength of the bias away from overfitting, you will explore a general-purpose method called "cross validation". <p>You will implement both cross-validation and gradient descent to fit a ridge regression model and select the regularization constant.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

So how are we gonna do this?

How are we gonna use all of our data as our validation set?

We're gonna use something called K-fold cross validation.

Where the first step, it's just a preprocessing step.

Where we're gonna take our data, and divide it into K different blocks.

And, we have N total observations, so, every block of data is gonna have N over

K observations, and these observations are randomly assigned to each block.

Okay, so, this is really key that we're taking our tabulated data.

And in this image, even though it looks like it just might be parceling out

a table of data, the data in each one of these blocks is randomly assigned.

And for all steps of the algorithm that I'm gonna describe now,

we're gonna use exactly the same data split.

So, exactly the same assignments of

observations to each one of these different blocks.

Okay, then for each one of these K different blocks,

we're gonna cycle through treating each block as the validation set.

And using all the remaining observations to fit the model for

every value of lambda.

So in particular, we're gonna start by saying for a specific value of lambda and

we're gonna do a procedure for each of the K blocks and

at the end we're gonna cycle through all values of lambda.

So for right now, assume that we're looking at a specific lambda value out of

a set of possible values we might look at.

And now we're gonna cycle through each one of our blocks where

at the first iteration we're gonna fit our model using all the remaining data.

That's gonna produce something that I'm calling w hat lambda, so

indexed by this lambda that we're looking at.

So we're considering the first block as our validation set.

Then we're gonna take that fitted model and

we're gonna assess it's performance on this validation site.

That's gonna result in some error which I'm calling error sub one.

Meaning the error on the first block of data for this value of lambda.

Okay, so I'm gonna keep track, of the error for the value of lambda for

each block, and then I'm gonna do this for every value of lambda.

Okay, so I'm gonna move on to the next block,

treat that as my validation set, fit the model on all the remaining data,

compute the error of that fitted model on that second block of data.

Do this on a third block, fit data on all the remaining data, assess

the performance on the third block, and cycle through each of my blocks like this.

And at the end, I've tabulated my error

across each of these K different blocks for this value of lambda.

And what I'm gonna do is I'm gonna compute what's called the cross validation error

of lambda,

which is simply an average of the air that I had on each of the K different blocks.

So now I explicitly see how my measure of air, my summary of air for

the specific value lambda uses all of the data.

it's an average across the validation sets in each of the different blocks.

Then, I'm gonna repeat this procedure for every value that I'm considering of lambda

and I'm gonna choose the lambda that minimizes this cross validation error.

So I had to divide my data into K different blocks in order to run

this K full cross validation algorithm.

So a natural question is what value of K should I use?

Well you can show that the best approximation to the generalization error

of the model is given when you take K to be equal to N.

And what that means is that every block has just one observation.

So this is called leave-one-out cross validation.

So although it has the best approximation of what you're trying to estimate,

it tends to be very computationally intensive,

because what do we have to do for every value of lambda?

We have to do N fits of our model.

And if N is even reasonably large, and

if it's complicated to fit our model each time, that can be quite intensive.

So, instead what people tend to do is use K = 5 or 10,

this is called 5-fold or 10-fold cross validation.

Okay, so this summarizes our cross validation algorithm, which is a really,

really important algorithm for choosing two name parameters.

And even though we discussed this option of forming a training validation and

test set, typically you're in a situation where you don't have enough data

to form each one of those.

Or at least you don't know if you have enough data to have an accurate

approximation of generalization error as well as assessing the difference between

different models, so typically what people do is cross validation.

They hold out some test set and then they do either leave one out, 5-fold,

10-fold cross validation to choose their tuning parameter lambda.

And this is a really critical step in the machine learning workflow is

choosing these tuning parameters in order to select a model and use that for

the predictions or various tasks that you're interested in.

[MUSIC]