案例学习：预测房价

Loading...

来自 University of Washington 的课程

机器学习：回归

3645 个评分

案例学习：预测房价

从本节课中

Nearest Neighbors & Kernel Regression

Up to this point, we have focused on methods that fit parametric functions---like polynomials and hyperplanes---to the entire dataset. In this module, we instead turn our attention to a class of "nonparametric" methods. These methods allow the complexity of the model to increase as more data are observed, and result in fits that adapt locally to the observations. <p> We start by considering the simple and intuitive example of nonparametric methods, nearest neighbor regression: The prediction for a query point is based on the outputs of the most related observations in the training set. This approach is extremely simple, but can provide excellent predictions, especially for large datasets. You will deploy algorithms to search for the nearest neighbors and form predictions based on the discovered neighbors. Building on this idea, we turn to kernel regression. Instead of forming predictions based on a small set of neighboring observations, kernel regression uses all observations in the dataset, but the impact of these observations on the predicted value is weighted by their similarity to the query point. You will analyze the theoretical performance of these methods in the limit of infinite training data, and explore the scenarios in which these methods work well versus struggle. You will also implement these techniques and observe their practical behavior.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

So, we've talked about the performance of nearest neighbor regression,

as you get tons and tons of observations.

And things look pretty good in that case.

Actually, amazingly good, but the picture isn't quite as nice

when we start increasing the dimension of our input space or

if we just have a limited set of observations.

Because the performance of nearest neighbors regression, as we talked about

before, is very sensitive to how well the data covers the input space.

If you remember we had this plot early on where we just removed observations from

some point of our input space and we got a pretty bad, When you're a [INAUDIBLE].

Well, now imagine you have a really a huge, really,

really high dimensional space that you are looking at.

And, you need your observations to cover all points within this space.

Well, you're gonna need an exponentially large number of observations in

the dimension of the space you're looking at in order to cover this space and

have good nearest neighbor performance.

And that's typically really hard to do,

to have as many observations as you would need in these high-dimensional spaces.

So, the cases where you have limited data relative to the dimension of

the input space you're looking at Is where these parametric models that we've

talked about throughout the rest of this course become so important.

And another thing that's important to talk about is the complexity of our

nearest neighbor search and

the naive type of algorithm that we've talked about so far which is just our

search through our entire data set can be quite computationally intensive.

So, for example, if you have just one query point, xq,

and you're just doing one nearest neighbor search,

you have to scan through every single observation in your data set and

compute the distance to each one of those observations.

So, going from our query house to each one of the different houses

in our data set computing this similarity between these houses

in order to find the nearest neighbor.

So, that has complexity, that's linear in the number of observations that we have.

Or if we wanna do our k nearest neighbor as approach,

we have to maintain a list of the k nearest neighbors and

if you do this in this sorted queue that we've talked about.

Then, you can do this search in O(Nlogk).

But both of these, either for our 1-NN or our k-NN,

have complexity linear and the number of observations we have.

But what if N is huge?

Because this is the situation where we've said these k-NN approaches work so well.

Well, that can be really, really intensive to do, and

especially if you have to do many queries.

So, if you wanna predict the values at many points in your input space,

this can be a really, really intensive procedure.

So, instead, we're gonna talk about more efficient methods for

doing this type of nearest neighbor search in our Clustering and

Retrieval course later in this specialization.

[MUSIC]