案例学习：预测房价

Loading...

来自 University of Washington 的课程

机器学习：回归

3861 个评分

案例学习：预测房价

从本节课中

Nearest Neighbors & Kernel Regression

Up to this point, we have focused on methods that fit parametric functions---like polynomials and hyperplanes---to the entire dataset. In this module, we instead turn our attention to a class of "nonparametric" methods. These methods allow the complexity of the model to increase as more data are observed, and result in fits that adapt locally to the observations. <p> We start by considering the simple and intuitive example of nonparametric methods, nearest neighbor regression: The prediction for a query point is based on the outputs of the most related observations in the training set. This approach is extremely simple, but can provide excellent predictions, especially for large datasets. You will deploy algorithms to search for the nearest neighbors and form predictions based on the discovered neighbors. Building on this idea, we turn to kernel regression. Instead of forming predictions based on a small set of neighboring observations, kernel regression uses all observations in the dataset, but the impact of these observations on the predicted value is weighted by their similarity to the query point. You will analyze the theoretical performance of these methods in the limit of infinite training data, and explore the scenarios in which these methods work well versus struggle. You will also implement these techniques and observe their practical behavior.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

Well we've talked about what one Nearest Neighbors is doing intuitively, now let's

talk about how to actually implement one Nearest Neighbors in practice.

So to perform our Nearest Neighbor search,

what we have to do is we have to specify a query input, which will be for example,

some house that we're interested in assessing it's value.

And then we also need to provide our data set, our set of training houses.

And we also need to specify a distance metric,

which is our way of measuring the similarity between houses.

And then the output of our one Nearest Neighbor search is

gonna be the most similar house.

And we can use the value associated with that house as our prediction for

our query house.

So more explicitly our one nearest neighbor algorithm we can initialize

what I'm calling distance to Nearest Neighbor to be infinity and

initialize our closest house to be the empty set.

Then what we do is we're going to step through every house in our dataset.

And we're going to compute the distance from our query house

to the house that we're at at our current iteration, and

if that distance is less than the current distance to our nearest neighbor.

Which at first is infinity, so at the first iteration.

You're definitely gonna have a closer house.

So the first house that you search over you're gonna choose as your

nearest neighbor for that iteration.

But remember, you're gonna continue on.

And so what you're gonna do is,

if the distance is less than the distance to your nearest neighbor,

you're gonna set your current nearest neighbor equal to that house, or that x.

And then you're gonna set your distance to nearest neighbor

equal to the distance that you had to that house.

And then you're just gonna iterate through all your houses.

And in the end what you're gonna return is the house that was most similar.

And then we can use that for prediction to say that the value associated with that

house is the value we're predicting for our query house.

So let's look at what this gives us on some actual data.

So, here, we drew some set of observations from a true function that

had this kind of curved shape to it.

And, the blue line indicates the true function

that was used to generate this data.

And what the green represents is our one nearest neighbor fit.

And what you can see is that the fit looks pretty good for

data that's very dense in our input space.

So, dense and x.

We get lots of observation across our whole input space.

But if we just removed some observations in a region of our input space,

and things start to look not as great because Nearest Neighbor

really struggles to interpolate across regions of the input space

where you don't have any observations or you have very few observations.

And likewise, if we look at a data set that's much noisier,

we see that our Nearest Neighbor fit is also quite wild.

So this looks exactly like the types of plots we

showed when we talked about models that were overfitting our data.

So what we see is that one,

Nearest Neighbors is also really sensitive to noise in the data.

[MUSIC]