案例学习：预测房价

Loading...

来自 华盛顿大学 的课程

机器学习：回归

3445 评分

案例学习：预测房价

从本节课中

Nearest Neighbors & Kernel Regression

Up to this point, we have focused on methods that fit parametric functions---like polynomials and hyperplanes---to the entire dataset. In this module, we instead turn our attention to a class of "nonparametric" methods. These methods allow the complexity of the model to increase as more data are observed, and result in fits that adapt locally to the observations. <p> We start by considering the simple and intuitive example of nonparametric methods, nearest neighbor regression: The prediction for a query point is based on the outputs of the most related observations in the training set. This approach is extremely simple, but can provide excellent predictions, especially for large datasets. You will deploy algorithms to search for the nearest neighbors and form predictions based on the discovered neighbors. Building on this idea, we turn to kernel regression. Instead of forming predictions based on a small set of neighboring observations, kernel regression uses all observations in the dataset, but the impact of these observations on the predicted value is weighted by their similarity to the query point. You will analyze the theoretical performance of these methods in the limit of infinite training data, and explore the scenarios in which these methods work well versus struggle. You will also implement these techniques and observe their practical behavior.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

So, these ideas lead directly to a related, but

slightly a different approach called kernel regression.

Let's start by just recalling what we did for weighted k-NN.

And what we did was we took our set of k-NN, and we applied some weight to

each nearest neighbor based on how close that neighbor was to our query point.

Well for kernel regression, instead of just weighting some set of k-NN,

we're gonna apply weights to every observation in our training set.

So in particular, our predicted value is gonna sum over every observation, and

we're gonna have a weight C sub qi on each one of these data points.

And this weight is gonna be defined according to the kernels that we talked

about for weighted k-NN.

And in statistics, this is often called Nadarya-Watson kernel weighted averages.

So let's see what effect this kernel regression has in practice.

And now what we're showing is this yellow, curved region,

represents some kernel that we're choosing.

In this case, it's called the Epanechnikov kernel, and

what we see are some set of red, highlighted observations for

our given target point, in this case called (x, 0).

And I want to emphasize that we didn't set a fixed number of observations to

highlight as red like we did in our k-NN.

Here what we did is we chose a kernel with a given, what's called bandwidth,

that's the lambda parameter we discussed earlier, and that defines a region in

which our observations are gonna be included in our waiting.

Because, in this case, when we're looking at this Epanechnikov kernel,

this kernel has bounded support.

And that's in contrast to, for example, the Gaussian kernel.

And what that means is that there's some region in which points will have this

weighting, these decaying sets of weights.

And then outside this region the observations are completely discarded

from our fit at this one point.

So, what we do is at every target point, (x, 0),

in our input space, we weight our observations by this kernel.

So we compute this weighted average, and we say that is our predicted value.

We do that at each point in our input space to carve

out this green line as our predicted fit.

So the result of this kernel regression isn't very different from than what

the fit would look like from weighted k-NN.

It's just in this case instead of specifying k,

we're specifying a region based on this kernel bandwidth for

weighting the observations to form this fit.

But what we see is, in this case for, for this kernel regression fit,

which like I've alluded to, should look fairly similar to our weighted k-NN.

Things look a lot smoother than our standard k-NN's fit.

So there are two important questions when doing kernal regression.

One is which kernel to use, and

the other is, for a given kernel what bandwidth should I use?

But typically the choice of the bandwidth matters much

more then the choice of kernel.

So to motivate this, let's just look again at this Epanechnikov

kernel with a couple different choices of bandwidths.

And what we see is that the fit dramatically changes.

Here for example, we get a much wilder overfitting.

Here for this lambda value, things look pretty reasonable.

But when we choose a bandwidth that's too large, we start to get over smoothing.

So this is an oversmooth fit to the data.

I'm not making very good predictions.

So we can think of this in terms of this bias variance trade-off.

This lambda parameter is controlling how much we're fitting

our observations, which is having low bias but high variance.

Very sensitive to the observations.

If I change the observations, I get a dramatically different fit for

kernel regression, versus over here for a very large bandwidth, it's the opposite.

I have very low variance as I'm changing my data.

I'm not gonna change the fit very much, but high bias.

Significant differences relative to the true function, which is shown in blue.

But just to show how we're fairly insensitive to the choice of kernel,

here in the middle plot I'm just gonna

change the kernel from this Epanechnikov to our boxcar kernel.

And where the boxcar kernel, instead of having decaying weights with distance,

it's gonna have a fixed set of weights.

But we're only gonna include observations within some fixed window,

just like the Epanechnikov kernel.

So this boxcar kernel starts to look very, very similar to our standard k-NN.

But again, instead of fixing what k is,

you're fixing a neighborhood about your observation.

But what you see is that although this boxcar window has these types of

discontinuities we saw with k-NN, because observations are either in or they're out.

The fit looks fairly similar to our Epanechnikov kernel.

So, this is why we're saying that the choice of bandwidth has a much

larger impact than the choice of kernel.

So this leads to the next important question of, how are we gonna choose our

bandwidth parameter lambda that we said matters so much?

Or when we're talking about k-NN, the equivalent parameter we have to choose is

k, the number of nearest neighbors we're gonna look at.

And again, in the k-NN, I just wanna mention this now,

we saw a similar kind of bias variance trade-off, where for

one nearest neighbors we saw these crazy, wildly, overfit functions.

But once we got to k, for some larger k value,

we had much more reasonable and well behaved fits.

So again, we have the same type of bias variance trade-off for

that parameter as well.

And so the question is, how are we choosing these tuning parameters for

these methods that we're looking at?

Well, its the same story as before, so we don't have to go into

the lengthy conversations that we've had in past modules, and

hopefully you know the answer is cross validation.

Or, using some validation set, assuming you have enough data to do that.

[MUSIC]