Covering the tools and techniques of both multivariate and geographical analysis, this course provides hands-on experience visualizing data that represents multiple variables. This course will use statistical techniques and software to develop and analyze geographical knowledge.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

Huan Liu

Professor: Computer Science and Engineering School of Computing, Informatics, and Decision Systems Engineering (CASCADE)

Welcome back. This is Huan Lui,

professor of computer science and engineering.

In this segment, we are discussing unsupervised learning.

And then we try to give you an idea about what is unsupervised learning,

what is typical method of unsupervised learning?

So, unsupervised learning is also called clustering.

Unsupervised because we don't have labels as in supervised learning.

But we still have data.

And how can we make sense of the data?

Now in this example,

if we look at it,

obviously we see some patterns, right?

And we have blue, red, green,

yellow and they probably mean something.

So, if we can figure out some patterns from the data without labels,

that will be very helpful for us to do various tasks.

So, clustering algorithms group together similar items,

in this case similar color points are clustered together.

Some similar data points together,

so we need a distance measure or similarity measure.

So here you see we we just want to use euclidean distance to illustrate.

Of course there are other methods like

Pearson linear correlation as well others,we will discuss right after this.

So for these two data points,

we can calculate how close they are.

The closer they are,

the more similar they are,

and then we can put them together.

So distance measure is one way to achieve this similarity calculation.

Know for X and Y n-dimensional vectors, right,

so as large one or two,

but N sometimes can be very large in this case you see n can be very large.

Then we can have different measures,

so Marlon Noble's distance is the one we need to calculate the covariance metrics.

In Manhattan distance, it is basically an L1 norm just to calculate the absolute.

difference between two elements of these two vectors.

Then there's the LP norm,

P can be one or two,

or very large number even infinity.

So there are various kinds of distance measures.

Clustering or unsupervised learning,

is to try to find

those clusters within which the instances are similar,

and then how can the weed represent a cluster?

We just use cluster centroids,

so closer centroids can

be way smaller number of cluster centroids way smaller than data points.

So basically we just use these centroids to represent clusters.

So now, one of the most common type of algorithms is partitional method,

you just partition the data into a set of

predefined clusters then you just calculate like each instance in the data set,

see which cluster it is closest to

then assign this instance to that cluster, Okay this is one method.

Of course there's another method then it's bottom up,

but we're not going to discuss this here.

We will focus on the partitional method.

Now in this 2D case,

I have six clusters, right?

So if I know before I have six clusters,

then I will be able to cluster all of them accordingly.

But if I say on a case two,

then probably I have different ways to cluster, right?

It's one here, it's one here or a one here another one here.

So which one's the best?

And there are so many combinations.

We introduce k-means, an intuitive and the common algorithm.

What it does this,

common data points XY in initial set of K centroids.

So, M1 to MK,

this is the one,

and this is T, it's the iteration we have to go through, okay?

So in the very beginning it is one.

We need to have k centroids,

then what do we do is for each data point,

we will compare with all of the centroids and

see if it's close to one of them,

then we can assign this data point to this cluster corresponding to the centroid.

So in this case, you see at the T iteration.

So, if I can find the mi,

this is the centroid i,

close is closest, right?

this is the one it's smaller than all the others, the remaining ones.

Then I will assign this XP to i,

so that's the cluster i,

but this is just like for one,

right, instance, but if I have N instances,

I have to assign all of them accordingly to

these k clusters and that next one should be T equals two now.

But a centroid should be updated,

because now I have each cluster,

I have new instances from the streaming data and I need to

recalculate and then I repeated this process.

This is the algorithm and we have a data set of real value attributes,

K the number of clusters and we will return a clustering.

The result of clustering is called clustering, Okay.

A clustering of data into K clusters.

You say How can I determine the very fastest K centroids?

So this is why we usually we consider key random instances in the data space.

As the initial cluster centroids.

Now, after we determine that,

we will go through the whole dataset.

While centroids have not converged.

What does that mean? We will do this later.

So, if it's not defined, this convergence later.

So, assign each instance to the cluster that has the closest cluster centroid.

If all instances have been assigned,

then recalculate the cluster centroid by averaging instances inside each cluster.

So it's very intuitive. So that's why for any clustering work,

we need to use k-means.

If we know k, then we use k-Means as a baseline algorithm,

to compare with other more complicated algorithms.

This is intuitive that's why people tend to accept the result more.

But the issue is,

how can I determine the right k, right?

So there are certain issues,

and when do we stop?

So we mentioned about the convergence.

So, convergence can be defined when centroids are no longer changing.

So, like at t iteration,

I have k centroids.

At t plus one iterations,

the set of centroids is the same as the one before,

the set before, then we say it convergences.

But there's another way to define convergence.

We can't just say algorithms can be stopped when the Euclidean distance between

centroids in two consecutive steps is less than some small positive value.

Why? Because k-means does not guarantee that the algorithm will converge.

So in this case,

of course, we cannot wait forever.

So we have to use this distance,

like a small distance between two sets of centroids to decide when to stop.

So an alternative to k-means is to

implement it as an objective function then we try to minimize it.

So, what is that?

Again, we tried to calculate

the distance between data points in this cluster, ith cluster.

You see, we have k clusters.

I is one to k. So,

for ith cluster, so n right?

I have n_i means I have how many instances in this ith cluster?

So I just calculated the distance between each instance in

this cluster with the centroid then square it.

Why? Because the distance must be positive,

but x_ji minus c_i can be positive or negative.

So that's why we have to square it.

Then we add them together.

We calculate the total error.

But now we want to minimize it,

then this is a good one.

Okay? So stopping criteria here is when the difference between

objective function values of two consecutive iterations

of k-means algorithms is less than some small value.

Then we stop. Now, more discussion.

So, one is how can we know we have the right k? What should we do?

We can always talk to domain experts if they have a sense of the data,

or if they want you to cluster data into certain number of k clusters.

But another way is your let your algorithm tell you.

So you just ran like a k equals three,

five, seven, whatever number,

just let the computer work for you.

But finding global optimum of k partition is computationally expensive.

It's difficult to programme.

And this is equivalent to finding optimal centroids that minimize objective function,

since we cannot wait forever.

So we just try to use some heuristics to help us.

Because at the end of the day when we do clustering,

we want to understand the data.

Then, after we gain some understanding,

we can always revise our hypothesis and retry.

Okay? That's why we can run k-means multiple times.

So, run k-means multiple times has several meanings.

One is I can change k. The other one is after I fix a k value,

I can also run multiple times.

Why? Because remember in the very beginning for us to choose k centroids,

it's just a random process.

So we never know which random set is the best.

So, we can just run several times,

and then we will be able to find the good one.

But now, we also want to talk about

the difference between means and medians, okay? So why?

Because, mean we know is the average,

and median is the middle value.

There are differences.

Okay?. So, here is an example.

We have a startup with eight employees,

one CTO and one CEO.

These are their salaries: 50k,

50k, 50k, 50k, 50k.

Five employees with 50k,

and then two employees with 80k,

and one employee has 90k,

and the CTO and CEO each has 150K.

So what's the average?

The average of the salary is if you calculate,

like you add them all up and divide by 10, right?

So, it is 80k. What is the median?

Medium is the middle value.

Now, I have 10 values.

So, one, two, three, four, five,

this is the one, six this is the one.

Because it's even number,

I have to add these two up and divide by two.

So it's 65k. Do you see?

Which one represents this group of employees,

this company's salary better?

It seems 65k is more reasonable, right?

But if we change let's say CEO's salary to one million,

and the other employees' salaries remain the same,

now, we have an average.

You just add all these up 50k times five,

80k times two plus 90k

plus 150k plus one million divided by 10.

It is 165k.

Now suddenly this company's average salary is 165k.

Of course, it's outrageous.

It's not real at all.

Right? Even the CTO does not acquire such a high salary.

Okay? So that's why average can be misleading.

But now what is the median?

The median remains the same,

still 65k because it's still these two values, the middle value.

So, should we use mean or medians?

Right.? It depends. If I have a large set of data,

then which one is easier to update?

For the k-means right?

We have to update the means. So you see.

But if we, and sometimes we don't want to update too frequently,

then the average seems is the value you can easy to update.

Median, it is not because if you add one person,

let's say if you add another 50k here,

this will be the middle of value, right?

So, that means it can change,

that means you have to recalculate.

That's the end of unsupervised learning.

We just discussed one intuitive algorithm. It's called the k-means.