Covering the tools and techniques of both multivariate and geographical analysis, this course provides hands-on experience visualizing data that represents multiple variables. This course will use statistical techniques and software to develop and analyze geographical knowledge.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

Huan Liu

Professor: Computer Science and Engineering School of Computing, Informatics, and Decision Systems Engineering (CASCADE)

Welcome back. This is Huan Liu,

and we will discuss nearest neighbor classifier.

This is a very intuitive learning method.

Now, we start with another data set.

In this data set, that we have

total eight instances but if you notice the last one it has a question mark.

That means we don't know about its class.

What's the class value the right to predict?

It's play golf or not.

So, in this case we have three also three attributes; outlook, temperature, humidity.

And then the class level is about play golf so we have yes or no.

And for the first seven instances,

we know their class labels,

but for the last one we don't.

For outlook, actually we have three values; sunny, overcast, rain.

And for temperature we can also have some values like hot, mild, and cool.

Humidity, high, normal, high.

And for these values,

we have with the known training data,

we want to know what is the class label for the last one.

So, very easy way is for this small data set,

we just compare the last instance.

Iwith all the seven instances, right?

If we can find a similar one or the most similar one,

then we just say because they are the most similar pair and we just use

the one we found was known class and give it the same class to the unknown class.

Which was this most similar one?

We have to break down the data set or to tell us and use some distance measure.

Now, for similarity, if we use distance,

if distance is zero,

that means they're exactly the same.

If the larger the distance,

the more dissimilar they are.

We just need to figure out some distance measure then we

can calculate how similar they are.

Now, in this case,

we have six clusters of

data points and if the unknown data point falls into one of them,

let's assume the shapes are class labelled.

Then if that data point is here because it's closest to this cluster,

then we just say this is square,

and because it's farther away from the other clusters.

Now, in this case,

we call it one nearest neighbor.

We only need to find wherever the nearest instance

then we will be able to give this instance class label to the unknown one.

Why do we need the k then?

Now, let's look at the situation here.

If we just use one nearest neighbor,

let's assume this is the nearest one.

Then for the question mark instance,

we will give it class label square.

But if we look beyond,

let's look at five nearest neighbor,

then the class label becomes triangle,

but we can even go further.

And if you look at 10 nearest neighbor,

then it becomes square so, you see.

So, one nearest neighbor is square,

five nearest neighbor is triangle

and 10 nearest neighbor it is square again.

It depends on the k you will have different class label.

We cannot just use one nearest neighbor because it's too sensitive to noise.

In reality, our data set always contains some noise.

So, k-nearest neighbor is a very intuitive method.

We just need to define what k would we like.

Then we figure out what other k-nearest instances, their neighbors,

then we say, "Hey,

what are these neighbors class should be the unknown instances class?"

And k equals one.

It is not robust to noise.

So, usually, we will choose k is

an odd number like a three, five, seven.

And after we determine what a k is,

we also need to choose a distance match to calculate the distance.

We're back to this example.

And again, we say we

have seven instances with known class labels,

but the last one or the right two predict,

should we go for golf playing or not?

This table, it is organized in terms of the k. So if we choose k equals one,

just one nearest the neighbor, what happens?

If we choose k as two,

two the nearest neighbor, what happens?

Usually, we don't try to have an even number of k. Okay, why?

Because when it's 50/50,

it is difficult to decide.

You will see some examples later.

Okay? So now, if we just use one nearest the neighbor,

then number two instance,

it's the most similar.

Here, we use a very simple method.

If the two values are the same, then it's one.

If otherwise, it is zero.

Okay? So, because we try to measure similarity.

If the two values are the same, it is one.

Now look at two.

Two outlook value is sunny and

for this eight's instance outlook value is also sunny.

Temperature is mild, temperature is mild for this one.

Humidity is a high, humidity is a high.

So then, we say because this class label is N,

we give this N. N means when no.

Okay. This is for the case of k one nearest the neighbor.

If we look at the two,

then two nearest the neighbor,

then you will find the class label should be N. This is the N here.

And the instance we use to decide the label is one,

here one and two.

You see, both when k equals two we have these two instance to decide.

And this N both are N,

so we decide to give the provision N. But now,

when it's a three,

we have these top three.

When k is equals to three, then we have top three.

Then because N is the majority,

we still give it to N. Now,

look at k equals four,

then we have an even number of N and Y,

like two Ns and two Ys.

Now, it's difficult to decide.

That's why we usually try to avoid using

even number for k. But if it happens, what should we do?

We always just flip a coin and decide.

Okay? So now, if k equals five,

then you see, I have two Ns, three Ys.

In this case, I just give it Y as the class label.

This is the algorithm given in the book.

If you are interested, there's the book we use for reference,

it's called the Social Media Mining.

And it's available for free download.

There you can get all these hours and beyond

what we talk about here. So k-Nearest Neighbor.

First, we need to have an instance i,

a data set of real-attribute values,

and then we need to decide k. How many neighbors we should consider?

Then we pick a distance measure.

Previously, we just choose a very simple one, right?

If they are the same value, then it's one.

If otherwise, it's a zero.

So we compute k-nearest the neighbors,

then we find the majority class label.

We will give it to

L. We classify this instance

i as a class L. So,

here, we say if more than one majority label is available,

we just select one randomly.

There are two majority labels,

in the case like we have two Ns,

two Ys, then we just select randomly.

Otherwise, we always choose the majority.

Now, we've finished the explanation of k-Nearest Neighbor,

but I would like to mention that

k-Nearest Neighbor is also called the Lazy Learning Algorithm.

Why is it called the lazy learning?

Think. If we look at the algorithm,

we will find that this algorithm actually does not learn at all, right?

It only starts classifying an instance during the test phase.

Okay? So, it is called a lazy learning.

So it does not learn, but what does it do then?

It just compares a new instance without class label with

all the instances in your dataset with known classes.

But in this case,

we would ask how fast is k-Nearest Neighbor?

You see because it's a lazy learning algorithm,

it does not learn, is it still fast during testing phase?

As a matter of fact, it is not.

It can be very slow, because when the data is big,

we'd have a big data, right?

When data is big,

it has to go through all the instances and calculate their distance

because we never know what the new instance is.

So, we have to calculate on the fly.

So we need to figure out a way to speed up this k-Nearest Neighbor.

One way is, I already used this in this dataset as an example,

if there are six clusters,

we just use cluster representatives as the instances.

Instead of let's say,

each cluster has 10 or 20 data points,

if 20 data points,

I have a 120 data points here.

But if I just use the representatives of these clusters,

I only have six cluster data points.

Let's say one here, one here,

one here, one here. These are the six.

Maybe now I can do it much faster than 120.

It's six versus 120,

each cluster, instances, a cluster.

So this is a one way, but we can also do it the hierarchical way.

So if it's really big,

we represent using these representatives,

then after we decide this is the cluster,

then we zoom in and compare because it's so pure here in this case.

In reality, it's not that pure.

Then we can zoom in and do the k-Nearest Neighbor as we've explained before.