1:07

This algorithm is a very easy and fast algorithm to use, thus,

it's a very popular algorithm.

It's often one of the first algorithms people get exposed to.

We've seen a similar algorithm with the K-nearest neighbors,

where we calculated neighboring distances between K points.

Here, however, we have K-clusters.

And the algorithm is done in a unsupervised manner in that we don't know

whether a data point belongs to a cluster ahead of time.

We simply have their features and the values for those features for

all the different instances in the data set.

2:09

The first reading is going to talk about the K-means clustering algorithm and

how you can use it via a blog article.

There's also an interactive demo for the K-means algorithm.

And I really like this demo a lot so I'm going to walk you through that.

And then, lastly, there is the notebook.

So first, the blog article that talks about K-means clustering.

This is a very low level article that introduces the idea of K-means clustering.

So it should give you a pretty good feel for how things are working.

One point of emphasis that I should make is K-means

is called the means because we find the cluster center,

or we define the cluster center by finding the mean of the data points.

You could also do a median of the data points,

in which case the algorithm is called K-medians.

There's other statistical quantities you might use,

it just changes the name of the algorithm slightly.

It doesn't change the way the algorithm works.

That's just the statistic that's used to determine the cluster centers.

So this article's nice, it talks about this.

It talks a little bit about how you can choose K to try to get the best

results for your algorithm or for your data set.

I wanted to show you this, though.

Because this is a really nice article that actually gives you

a way of picking a different data set.

So here you can see that here's a data set.

And we can just start adding cluster centers.

And we can see how the algorithm starts clustering things, so let's try one more.

And now, we just click GO and Update Centroids.

And you can see how it just moves the centroids around.

And keep clicking on this until there's no more movement.

And you could notice that right now there's only a little bit of movement here

between the red and black cells there.

As I keep moving around, there's really not much movement, and

we're pretty much done.

This is the end of the algorithm, right?

So you could see that, well, maybe these really were just one cluster here.

Maybe I added too many centroids.

But you can go in and try this with different data sets,

different number of centroids, and really see visually see what's happening.

So let me go ahead and do that.

Let's choose it randomly, here we go, we got 3 clear cells.

So let's put those 3 together, right.

Now, as I go GO, you can see what happens.

Very first thing is that it assigns points to the nearest cluster center.

So these points over here get split to green and blue.

Now, as I click Update Centroids, you notice that, very quickly,

that green center moves over.

So literally within the second running of this algorithm is iteration.

We've already got our 3 centers on the data that they should be assigned to.

Now I reassign points, notice the points immediately go back.

Update my centroids, right, we're almost done, we could keep doing this.

And, look, now, I'm getting no change in the cluster centers.

So, very quickly, within,

basically, 3 iterations, this algorithm found the 3 clusters.

Now, this was easy, because I could see there were 3 clusters at the beginning.

One way we might've done this in practice is try 2 clusters and

see how well it does, try 3, try 4, try 5.

And we could try to figure out some sort of metric for

indicating whether that number of clusters was the right value or not.

5:17

The last item here is the K-means spatial clustering notebook.

We're going to just take a little about the Iris data and

use the Iris data with the K-means algorithm.

I'm going to show you the Elbow method, which is a way to determine K empirically.

And then we're going to use the handwritten digit data to understand how

the K-means algorithm performs on a larger data set.

It's a higher dimensional data set.

So we'll see that performance.

We'll see how it does with dimension reduction.

We'll also look at it with manifold learning.

So first step here when we're applying our K-means algorithm, there's 2 steps.

The assignment step, we assign each data to a cluster.

And then there's the update step, where you compute the new cluster centers,

and then we iterate back.

Now you saw this in the previous website where we

computed our initial cluster centers.

We assigned the points to the clusters and we updated, etc.

6:15

Then we apply this and we're going to see here's our 3 clusters.

I said K was 3 earlier.

So now we have 3 clusters, and then we can visualize these clusters.

So when I scroll down, you can see there's the first cluster center,

there's the second and there's the third.

This is a very useful technique to do.

But let's try it with the Elbow method,

which will allow us to basically compute for

a different number of clusters what the score, total distance in this case, is.

So basically what you do is you say, well, if there's only one cluster,

all points, if I sum up the total distances to the nearest cluster center,

it's going to be really big.

As the number of clusters increases,

then each data point has a cluster center closer to it.

And so the total distance will be smaller.

And the idea of the Elbow method is the optimal value is somewhere here in this

curve break point.

It's where you change from the very steep part, where the distances are changing

very rapidly, to the more shallow part of the curve.

So here we would look at this curve, and

we would go this looks like 3 is really the elbow.

And the idea here is it should look like somebody's arm.

So the arm is hanging down, and then at the elbow it bends.

That's the idea of the elbow method.

7:21

Now we can look at it with the digit data, remember, the handwritten digits.

As I look across the screen, you can see 0 through 9.

We can cluster these images, and when we look at this, first of all,

this is a confusion matrix showing how where we're doing.

You look at this, you go this is horrible.

We have 0 is matched to 5, not 0, etc.

Well, keep in mind, this is an unsupervised technique.

Thus, when we say, when something comes in as cluster 0, or

the handwritten digit 0, that doesn't mean it actually is labelled 0.

Instead it's just a group of points.

And so when we come in and find them, it actually turns out we did quite well.

Almost all of them were marked together.

That's the key thing, how well are they marked together.

And as we look through this, most of the data points are marked quite nicely.

Just looking at this confusion matrix,

we see that 6 and 2 seem to be not done quite right.

So, as I scroll down, we look, and

we can see, yup, sure enough, there is the cluster 2, it's a little ambiguous.

And cluster 6, cluster 6 is very ambiguous.

It looks just like a vertical line.

But some of these others, they look pretty clear, right?

We have a 4, we have a 0, we have a 9, we have a 6, we have a 7.

We have a 2, we have a 3 and a 5.

And even this one kind of looks like a 1.

And if we scroll up here and

we see it, you can see that that's sort of spread out, and yeah, it sure is.

It actually is a 1, so that's doing a pretty good job of pulling that out.

But, in general, this technique does quite well.

8:41

As I scroll down, you can sort of see other examples of what we're trying to do.

We can transform the date in.

Now we can get the actual typical image values for each cluster.

So if you were to think about each cluster center being an actual

image in this space, this is what the image would correspond to.

And, again, see there's cluster 2 and cluster 6, they're a little ambiguous.

We can also say what's the score?

Is there a way, a metric, of determining how well we're doing?

And there is a number of them, and I show some of them here.

We can see better what they do down below.

We actually can compute these scores and show them.

Typically, a value of 1 for most of these metrics is best.

And so, you can see, they generally do quite well.

And this, again, is for the digit data.

And then we could say, well, that was a pretty big data set,

dimensionality for our data set.

What if we cut the number of dimensions down?

What if we cut it to, say, 20?

We start with 64, if we cut to 20, how well do we perform?

And you can see that here's a confusion matrix, here's the actual results.

And here's one of our scores, one of our performance metrics for finding clusters.

And you can see it's still quite high.

So it turns out that your cluster finding still does rather well for

the reduced dimensionality of the data set,

even though you've sort of thrown some of the information away.

Enough is still there to be able to do realistic clustering.

So this is an important point.

Sometimes you'll be faced with a very large dimensional data set.

You can perform dimensional reduction or some sort of feature selection.

Reduce the large amount of features down to a more manageable set, and

still get very good accuracy results.

If you think about it, that's kind of what manifold learning does,

in that we start with this really high-dimensional space.

And we find groups of points that are nearby each other in this

high-dimensional space.

And we map them down to this manifold in two-dimensional, three-dimensional space.

And try and keep those points close together.

So you could ask, well,

what happens when we apply manifold learning to K-nearest neighbors?

So first thing we do is we compute our manifold, since we can generate.

Here's our t-SNE digit data, you can see that there it is.

And then we can actually find clusters in this same data set.

And then say, well, how well did they map down?

And again, you could see that generally these clusters that were marked

outside here are done quite well.

And only really in the middle do we start getting that mixing.

It's the same idea we saw before with some of those clusters

were a little bit confused in terms of what they actually were.