0:15

Hello. This lesson we're going to introduce one of

the first machine learning algorithms that we'll be discussing in more detail.

In particular, it's the k-nearest neighbors or k-nn algorithm.

This algorithm is important for a couple of reasons.

First of all, it's a very simple algorithm to understand.

And second of all, it's known as a lazy evaluator or non-parametric technique

that simply uses the properties of

the training data to make predictions about new test data.

The key aspect here is that we don't actually build a model with this algorithm.

Instead, it uses the training data during

the prediction phase to quantify or make predictions about that new data.

So, by the end of this lesson,

you should be able to articulate what's going on with the k-nearest neighbor algorithm.

You should also be able to apply a k-nearest neighbor algorithm for

classification or regression tasks by using the Scikit-learn Library.

And you should know where you can apply this algorithm

successfully and where you probably need to use a different algorithm.

Now, the entirety of this lesson is

contained in the introduction to k-nearest neighbors notebook.

So what is this notebook going to do?

Well, we're going to talk about this algorithm,

and we're going to move on to other ideas that are

applicable to other algorithms as well because

the k-nn provides a nice demonstration of some of these.

So, what is this k-nn algorithm?

If you think about the way where data comes from,

you can understand that often,

data points that are near each other are related.

I like to use the example of people.

If you think about people who are similar to you,

you typically live near people that are similar to you either because of income,

educational level or religious or political beliefs.

Thus, if you take a census and you look at people that live in the same neighborhood,

there's often a lot of similarities between those people.

Thus if somebody new moves into that neighborhood,

and you want to understand what are they probably like,

the easiest thing to do is to average over their neighbors.

Now, as you add more neighbors,

you can sometimes get a more accurate prediction.

But on the other hand if you start going too far away,

you may start getting other neighborhoods that

have other properties and you may lose accuracy,

so that number of neighbors or distance that you expose

your algorithm to making predictions can be important.

So, what are some things that we have to understand?

First, we have to understand how to compute distances,

and there's actually different ways to compute distances,

that will be important in trying to find the nearest neighbors.

We'll also have to look at something known as the curse of dimensionality.

And the reason we talk about that is because usually

you would think as you add more and more features or dimensions to your data,

you'll be able to make bigger predictions.

But, sometimes that can actually slow down

the computations because it becomes harder and harder to, in this case,

find nearest neighbors or alternatively,

you have to get more and more training data to adequately

sample the space that your data actually can occupy.

So, we'll do this by talking about some of those basic concepts.

We will introduce the idea of Helper code,

and I put these in a separate file that you can open.

But we're not going to go into too much detail,

they basically do things like load data,

do some of that pre-processing,

maybe they make visualization plots.

Then we'll look at k-nearest neighbors for classification.

For that, we're going to talk about some performance metrics,

we're going to talk about some of the hyper-parameters that might be important.

And then we're going to go on and look at regression.

So we start with our standard import,

and then we move into the actual formalism of this method.

So what I'm doing here is I'm actually generating some data,

and I'm transforming it into a different space and making a plot.

So here, I have our data and I say show me the five nearest neighbor.

So here's my green point.

That's my special point.

Might say this, blue are all the training data and green is now a test data point,

and I say find me the five nearest neighbors.

And you can see here's the five nearest neighbors.

If I had only said four,

then perhaps this point would have been picked.

And if I had said six,

probably this point would have been picked.

And as you see, as the number of neighbors changes,

the radius of a circle centered on

our test point will get bigger and include more and more points.

And we'll be averaging over a greater area.

That can be both good and bad,

and that's one of the reasons the number of

neighbors is a hyper-parameter that we have to tune.

Next step is what are distance measurements?

Typically, you might just say we'll use a Pythagorean theorem: x squared minus x1,

x2 minus x1 quantity square plus y2 minus y1 quantity squared square root.

You've seen that for years,

but that's really only applicable for data that followed the Euclidean Distance Metric.

And so if you've plotted points on a Cartesian plot,

that would be a good example of where you would do that.

But you might want to do something different for other types of data.

If it's currency data or even categorical or text data,

what kind of distances would you use?

For this, we define the idea of a metric

which is how do you measure distance for a particular type of data.

And there's a number of different metrics.

One is the Euclidean that we saw,

that's the, what we know as the L2-norm.

Another is Manhattan, and this is where you restrict distances to be on grid lines.

So if you think about it the idea this came from

taxicab distance on the island of Manhattan where the streets are all laid out in a grid,

one block square, and you had to stay on the streets.

You couldn't drive through buildings and so

the idea of this then is known as the L1-norm.

There's also some other ones I mentioned a few here, the haversine.

This is for distances traveled on the surface of a sphere.

If you've ever seen the the path of

an airline will take traveling on a flat map, it looks curved.

The reason for that is because it's actually

traveling across the surface of a sphere of the earth.

So when you flatten that out,

it actually the shortest distance between New York and

Los Angeles or San Francisco and Beijing,

is actually a curved path,

and we use the haversine metric for that.

Then there's some other ones, the Chebyshev and the Minkowski, as well as others.

And so we demonstrate calculating these in this particular code cell and you

should get a little familiar with this because it may be a little different for you.

So for instance, Euclidean distance, you should understand that.

The Manhattan distance, you're just summing up

the block travel times or travel distances,

the Chebyshev, the Minkowski.

I mentioned briefly here with the Curse of Dimensionality,

as the number of dimensions grows,

the number of points we have to have to maintain the

same sampling density jumps very fast.

And that of course is a problem.

So we can look at this in terms of a visualization and say well,

what's the relative density if we keep the number of points the

same as the number of dimensions increases what's our relative density?

And you can see that it very quickly drops.

So, really if you have a very high dimensional dataset,

you need a lot of training data to sufficiently span

that space to be able to make accurate predictions for new unseen data.

We talk a little bit about the Helper Code,

you can read about this and understand it.

There's a few functions I've created that load

the data and do some pre-processing for it or make plots.

We demonstrate one of them here.

We're importing from our Helper code and calling a function on it here to get the data.

This loads the data and splits it into training and testing,

and makes this plot.

So you can see here's the class 0,1,2,

that's the setosa, versicolor and virginica.

We also have some test data to show

data where the test data comes from and we can analyze this.

So the next thing is k-nearest neighbors again. How do we use this?

We import our module,

we create our estimator,

we get our testing and training data,

and then we fit our model to this training data,

and then we can make predictions on our test data.

So here we go. This is it.

Right? We saw this in

the first lesson notebook where we did the introduction to Machine Learning,

demonstrated classification with k-nearest neighbor classifier.

We specified one hyper-parameter,

the number of neighbors,

we fit the model and we score the model.

That's very simple. We can also try changing a few things.

We can look at the Confusion matrix.

So here we are going to explicitly make predictions for our test data,

and we're going to then see how well do we do on a per class basis.

And this is called a Confusion Matrix.

So each row is a different predicted class,

each column is the actual class.

I've made a convenience function called confusion,

which will plot this.

So it's a little easier to see but it was the same data.

So here you can have the predicted label and

the true label and you can see the relationship between them.

We can also measure other metrics such as precision and recall or F1 score.

And this talks about that these can be generated and displayed on

a per class basis with something called the classification report.

And that's demonstrated here,

where we see how well do our model

perform for the different classes as well as an average score.

The support is the number of test data points in each class,

and you can see those are all roughly the same.

The last thing I want to show you here about

classification is the idea of decision surfaces,

and how they can be used to understand the impact of hyper-parameters.

For the nearest neighbor algorithm,

we typically vary the neighbor hyper-parameter which might be two, three, four, five.

So this code here simply makes different plots of a decision surface.

And here's an example of that.

We've restricted ourselves to two dimensions,

Sepal width and Petal width.

Here's the data that we actually are predicting on.

Underneath it is a grid of points and you can see the colors change as we move around.

This is known as a Decision Surface because it

tells us if we put a desk data point right here,

a test data point, what would it be classified as.

And in this case, you can see it's going to be blue.

If it was over here it would be green and if it's over here it was brown.

So it shows you how these this algorithm is splitting up this underlying space.

Now, this was for three nearest neighbors.

If I change it to five nearest neighbors,

you'd notice that changes dramatically.

Right? Look up here, this is very different than what we had up here.

And it changes a little bit over here as well.

The idea of the decision surface is it shows you the impact of

a hyper parameter on making predictions in the general space that you're looking at.

So it's a good way to try to figure out what's a good value

for a hyper parameter if you don't want to resort to just scoring.

And then here we've increased it to nine,

and again it's changed things just slightly.

Notice now we have these two values here that are predicted right.

There's other things you can look at.

One of them is called weights.

This says for instance you may want to say if a data points close to me,

I want to give it a higher weight.

Data points that are farther away are going to be less so.

So we can do the same thing.

Here we're going to change weighing scheme from

uniform where all data has the same weight.

Alternatively, we could say we're going to weigh points by how far away they are from us.

So if they're farther away they have a lower weight.

And so we can look at the decision surface in this case,

here is uniform weighting.

This is the same plot we saw before for three nearest neighbors,

and then we can look at distance weighting and you could see how it changes.

Notice that this point is now correctly classified, and this one is as well.

This is the only one that's not.

That's interesting in how that all works,

and probably gives you a bit of a feel for why decision surfaces are important.

So the next thing is regression,

k-nearest neighbor for regression,

very similar to classification.

We simply change our estimator, we get our data.

In this case we're just going to make a plot of the values,

and then we're going to show how the regression works.

So here's some data points.

Here's the k-nearest neighbor regressor for five with a uniform weighting,

and here's five with a uniform distance,

just to give you an introduction to regression here in this particular case.

So with that I'm going to go ahead and stop.

If you have any questions let us know.

K-nearest neighbor, very simple algorithm,

should give you a pretty good feel for how things work.

If you have any questions, let us know. And good luck.