案例学习：预测房价

Loading...

来自 University of Washington 的课程

机器学习：回归

3650 个评分

案例学习：预测房价

从本节课中

Nearest Neighbors & Kernel Regression

Up to this point, we have focused on methods that fit parametric functions---like polynomials and hyperplanes---to the entire dataset. In this module, we instead turn our attention to a class of "nonparametric" methods. These methods allow the complexity of the model to increase as more data are observed, and result in fits that adapt locally to the observations. <p> We start by considering the simple and intuitive example of nonparametric methods, nearest neighbor regression: The prediction for a query point is based on the outputs of the most related observations in the training set. This approach is extremely simple, but can provide excellent predictions, especially for large datasets. You will deploy algorithms to search for the nearest neighbors and form predictions based on the discovered neighbors. Building on this idea, we turn to kernel regression. Instead of forming predictions based on a small set of neighboring observations, kernel regression uses all observations in the dataset, but the impact of these observations on the predicted value is weighted by their similarity to the query point. You will analyze the theoretical performance of these methods in the limit of infinite training data, and explore the scenarios in which these methods work well versus struggle. You will also implement these techniques and observe their practical behavior.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

At the beginning of this module,

we talked about this idea of fitting globally versus fitting locally.

Now that we've seen k nearest neighbors and kernel regression,

I wanna formalize this idea.

So in particular,

let's look at what happens when we just fit a constant function to our data.

So in that case that's just computing what's called a global average where we

take all of our observations, add them together and take the average or

just divide by that total number of observations.

So that's exactly equivalent to summing over a weighted set of our observations,

where the weights are exactly the same on each of our data points, and

then dividing by the total sum of these weights.

So now that we've put our global average in this form, things start to look

very similar to the kernel regression ideas that we've looked at.

Where here it's almost like kernel regression, but

we're including every observation in our fit, and

we're having exactly the same weights on every observation.

So that's like using this box car kernel that puts the same weights on all

observations, and just having a really really massively large

bandwidth parameters such that for every point in our input space

all the other observations are gonna be included in the fit.

But now let's contrast that with a more standard version of kernel regression,

which leads to what we're gonna think of as locally constant fits.

Because [COUGH] if we look at the kernel regression equation,

what we see is that, it's exactly what we had for

our global average, but now it's gonna be weighted by this kernel.

Where in a lot of cases, what that kernel is doing,

is it's putting a hard limit that some observations outside of our window

of around whatever target point what we're looking at, are out of our calculation.

So the simplest case we can talk about is this box car kernel,

that's gonna put equal weights over all observations, but

just local to our target point x,o.

And so, we're gonna get a constant fit but, just at that one target point,

and then we're going to get a different constant fit at the next target point, and

the next one, and the next one.

And, I want to be clear that the resulting output isn't

a stair case kind of function.

It's not a collection of these constant fits.

It is a collection of the constant fits, but just at a single point.

So we're taking a single point, doing another constant fit,

taking the single point, which is at that target, and as we're doing this over

all our different inputs that's what's defining this green curve.

Okay, but if we look at another kernel,

like our Epanechnikov kernel that has the weights decaying over this fixed region.

Well, it is still doing a constant fit, but how is it

figuring out what the level of that line should be at our target point?

Well, what it's doing is,

it's just down weighting observations that are further from our target point and

emphasizing more heavily the observations more close to our target point.

So this is just a weighted global average but its no longer global it's

local because we're only looking at observations within this defined window.

So we're doing this weighted average locally at each one of our input

points and tracing out this green curve.

So, this hopefully makes very clear how before

in the types of linear regression models we were talking about, we

were doing these global fits which in the simplest case, was just a constant model.

That was our most basic model we could consider having just the constant feature

and now what we're talking about is doing exactly the same thing but

locally and so locally that it's at every single point at our input space.

So this kernel regression method that we've described so far,

we've now motivated as fitting a constant function locally at each observation,

well more than each observation, each point in our input space.

And this is referred to as locally weighted averages but

instead of fitting a constant at each point in our input space

we could have likewise fit a line or polynomial.

And so what this leads to is something that's called locally

weighted linear regression.

We are not going to go through the details of of locally weighted linear regression

in this module.

It's fairly straightforward.

It's a similar idea to these local constant fits,

but now plugging in a line or polynomial.

But I wanted to leave you with a couple rules of thumb for which fit you

might choose between a different set of polynomials that you have options over.

And one thing that fitting a local line instead of a local constant helps you with

are those boundary effects that we talked about before.

The fact that you get these large biases at the boundary.

So you can show very formally that these local linear fits help with that bias, and

if we talk about local quadratic fits, that helps with bias that you get

at points of curvature in the interior view of space.

So, for example,

we see that blue curve we've been trying to fit, and if we go back,

maybe it's worth quickly jumping back to what our fit looks like we see that,

towards the boundary we get large biases, and right at the point of curvature, we

also have a bias where we're under fitting the true curvature of that blue function.

And so the local quadratic fit helps with fitting that curvature.

But what it does is it actually leads to a larger variance so

that can be unattractive.

So in general just a basic recommendation is to use just a standard local

linear regression, fitting lines at every point in the input space.

[MUSIC]