案例学习：预测房价

Loading...

来自 华盛顿大学 的课程

机器学习：回归

3440 评分

案例学习：预测房价

从本节课中

Nearest Neighbors & Kernel Regression

Up to this point, we have focused on methods that fit parametric functions---like polynomials and hyperplanes---to the entire dataset. In this module, we instead turn our attention to a class of "nonparametric" methods. These methods allow the complexity of the model to increase as more data are observed, and result in fits that adapt locally to the observations. <p> We start by considering the simple and intuitive example of nonparametric methods, nearest neighbor regression: The prediction for a query point is based on the outputs of the most related observations in the training set. This approach is extremely simple, but can provide excellent predictions, especially for large datasets. You will deploy algorithms to search for the nearest neighbors and form predictions based on the discovered neighbors. Building on this idea, we turn to kernel regression. Instead of forming predictions based on a small set of neighboring observations, kernel regression uses all observations in the dataset, but the impact of these observations on the predicted value is weighted by their similarity to the query point. You will analyze the theoretical performance of these methods in the limit of infinite training data, and explore the scenarios in which these methods work well versus struggle. You will also implement these techniques and observe their practical behavior.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

And, what this motivates us to look at is something called k-Nearest neighbors,

which is a very simple extension off of one nearest neighbors.

And, again going back to our housing application,

it's what real estate agents do.

I said they just look at the nearest house, but

that's not quite true, they tend to go out and get a set of what are called comps.

So, instead of just looking for the closest, most similar

house to yours, they're going to collect a set of houses that are similar.

And look at the values for that set of houses, and say, well,

your house value should be somehow related to these set of houses.

Because the agent is understanding that the one most similar house,

there might have been something strange about the value it got.

Maybe a buyer was really drawn to some specific feature of that house, and

that drove up the cost.

Or maybe it sold at a strange time in the year, and

maybe that gave it an artificially low value.

Sso looking at a larger set of houses to assess the value of yours

can give you a much more reliable estimate.

So, formally k nearest neighbors is gonna take the same data set that we had before

and a query point.

And what it's gonna do,

is it's gonna find the k closest observations in our data set.

So specifically let's call it x nearest neighbor 1,

x nearest neighbor 2, all the way to x nearest neighbor k.

I'm gonna assume that I've ordered these nearest neighbors so

that x nearest neighbor k, is the furthest,

the least similar out of this set to the query house.

Well this set is defined such that for

any xi not in my nearest neighbor set.

Then the distance, between xi and

my query point xq is greater than or

equal to the distance between my

kth nearest neighbor and xq.

So what this means is that, every other observation in the data set,

is further away than my kth nearest neighbor.

That's what defines it as my kth nearest neighbor.

Okay.

So then what do I do for my prediction?

Because when we were doing one nearest neighbor regression,

we were just taking our nearest neighbor looking at the output

associated with that nearest neighbor and saying that was my predicted value.

But now I have a set of nearest neighbors.

And so I need to decide what I'm gonna do with these nearest neighbors and

their associated outputs to form my prediction.

And what k nearest neighbors does,

is it says that predicted value associated with my query point

is simply gonna be the average value of all the nearest neighbor outputs.

So 1 / k y nearest neighbor 1 +

+ y nearest neighbor 2 + all the way up to y nearest neighbor k,

or we can write this more simply as 1/k

sum j=1 to k of y nearest neighbor j.

Okay, so k nearest neighbors,

like I said, is a very simple variant or generalization of one nearest neighbors.

Now, let's talk about how we implement our K nearest neighbor search in practice.

And the setup is exactly like it was for one nearest neighbor, but

instead of outputting just the most similar house,

we're gonna output a set of similar houses, specifically k similar houses.

For completeness I've included what the k nearest neighbor algorithm looks like

where the one key insight here is to maintain a sorted queue of the k

nearest neighbors as you're going through the different steps of your search.

So let's just walk through this algorithm here.

What we're gonna do is we're gonna initialize what

I'm calling distance to k nearest neighbors, which is a list of distances,

the current distances to the k nearest neighbors I have at the current iteration.

And so to start with, we can just look at the first k houses in our data set and

initialize those as our k nearest neighbors.

But we're gonna sort these distances for

these k nearest neighbors for these first k houses.

However these distances sort, we're gonna likewise sort the first k houses in

our data set to be our current estimate of our nearest neighbors.

Then for now, we're gonna just cycle through houses k + 1 to capital N.

What we're gonna do is we're gonna compute the distance of this new house that we

haven't seen before to our query house.

And if that distance is less than the distance to my kth nearest neighbor,

the house that's furthest away,

then what I'm going to do is I'm going to go through my sorted list.

Because remember just because this new house is closer then the kth nearest

neighbor, that doesn't mean that this new house becomes my kth nearest neighbor

because it could be closer than my second nearest neighbor or fourth.

So what I need to do is I need to go through and

find which index this house should go into.

Meaning there's some point at which all houses nearest neighbors one,

two, j minus one are actually closer than this house that I'm looking at now.

But this house is closer than all the other nearest neighbors in my set.

And what I'm gonna do, now that we have this index j, we're gonna go and

try and insert this house that we found that's one of our

current estimates of our nearest neighbor into the queue.

And what we're gonna do is we're simply knock off the kth nearest neighbor,

which we know was our worst estimate of our nearest neighbor so far.

We're gonna go and insert this house into this q.

So this step is just shifting the indices for our list of houses,

just knocking off that kth last house.

And then this step is doing exactly the same thing, but for

the distances associated with each of these nearest neighbors.

So finally, here, this is where we're inserting the new nearest neighbor.

Where, in that jth lost, we're putting the distance to this newly found house.

And in the jth slot of our list of houses, we're putting the house itself.

So once we cycle through all our observations in our data set, what

we're gonna end up with is a set of the k most similar houses to the query house.

And like we described before, when we wanna go and

predict the value associated with our query house,

we're simply gonna average over the values associated with these k nearest neighbors.

[MUSIC]