案例学习：预测房价

Loading...

来自 University of Washington 的课程

机器学习：回归

3713 个评分

案例学习：预测房价

从本节课中

Nearest Neighbors & Kernel Regression

Up to this point, we have focused on methods that fit parametric functions---like polynomials and hyperplanes---to the entire dataset. In this module, we instead turn our attention to a class of "nonparametric" methods. These methods allow the complexity of the model to increase as more data are observed, and result in fits that adapt locally to the observations. <p> We start by considering the simple and intuitive example of nonparametric methods, nearest neighbor regression: The prediction for a query point is based on the outputs of the most related observations in the training set. This approach is extremely simple, but can provide excellent predictions, especially for large datasets. You will deploy algorithms to search for the nearest neighbors and form predictions based on the discovered neighbors. Building on this idea, we turn to kernel regression. Instead of forming predictions based on a small set of neighboring observations, kernel regression uses all observations in the dataset, but the impact of these observations on the predicted value is weighted by their similarity to the query point. You will analyze the theoretical performance of these methods in the limit of infinite training data, and explore the scenarios in which these methods work well versus struggle. You will also implement these techniques and observe their practical behavior.

- Emily FoxAmazon Professor of Machine Learning

Statistics - Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering

[MUSIC]

How are defining distance?

Well, in 1-d it's really straightforward because our distance

on continuous space is just gonna be Euclidean distance.

Where we take our input-xi and

our query x-q and look at the absolute value between these numbers.

So, these might represent square feet for two houses and

we just look at the absolute value of their difference.

But when we get to higher dimensions,

there's lots of interesting distance metrics that we can think about.

And let's just go through one that tends to be pretty useful in practice,

where we're going to simply Weight the different dimensions differently but

use standard Euclidian distance otherwise.

So, it looks just like Euclidian distance, but

we're going to have different weightings on our different dimensions.

So, just to motivate this, going back to our housing application,

you could imagine that you have some set of different inputs,

which are Attributes of the house, like how many bedrooms it has.

How many bathrooms, square feet.

All our standard inputs that we've talked about before.

But when we think about saying which house is most similar to my house.

Well, some of these inputs might matter more

than others when I think about this notion of similarity.

So, for example number of bedrooms, number of bathrooms, square feet of the house.

Might be very relevant, much more so

than what year the house was renovated when I'm going to assess the similarity.

So, to account for this, what we can do is we can define what's called a scaled

Euclidean distance, where we take the distance between

now this vector Of inputs, let's call it x,j.

And this vector of inputs associated with our query house x,q and

we're gonna component wise look at their difference squared.

But then we're gonna scale it by some number.

And then we're gonna sum this over all our different dimensions, okay?

So, in particular I'm using this letter a to denote the scaling.

So, a sub d

is the scaling on our dth input, and what this is capturing is

the relative importance of these different inputs in computing this similarity.

And after we take the sum of all these squares we're gonna take the square root

and if all these a values were exactly equal to 1, meaning that all our inputs

had the same importance then this just reduces to standard Euclidean distance.

So, this is just one example of a distance metric we can define at multiple

dimensions, there's lots and

lots of other interesting choices we might look at as well But lets visualize what

impact different distance metrics have on our resulting nearest neighbor fit.

So, if we just use standard Euclidean distance on the data shown here.

We might get this image, which is shown on the right where the different

colors indicate what the predicted value is in each one of these regions.

Remember each region you're gonna assume any point in that region,

the predicted value is exactly the same because it has the same nearest neighbor.

So, that's why we get these different regions of constant color.

But if we look at the plot on the left hand side, where we're using a different

distance metric, what we see is we're defining different regions where

again those regions mean that any point within that region is closer to

the one data point lying in that region, than any of the other data points

in our training data set, but the way this distance is defined is different so

thus the region looks different, so for example, with this Manhattan distance

what this is saying just think of New York and driving along the streets of New York.

It's measuring distance along this axis-aligned directions, so

it's distance along the x direction plus distance along the y direction

which is a different difference than our standard Euclidean distance.

[MUSIC]