The first reading is going to talk about the K-means clustering algorithm and
how you can use it via a blog article.
There's also an interactive demo for the K-means algorithm.
And I really like this demo a lot so I'm going to walk you through that.
And then, lastly, there is the notebook.
So first, the blog article that talks about K-means clustering.
This is a very low level article that introduces the idea of K-means clustering.
So it should give you a pretty good feel for how things are working.
One point of emphasis that I should make is K-means
is called the means because we find the cluster center,
or we define the cluster center by finding the mean of the data points.
You could also do a median of the data points,
in which case the algorithm is called K-medians.
There's other statistical quantities you might use,
it just changes the name of the algorithm slightly.
It doesn't change the way the algorithm works.
That's just the statistic that's used to determine the cluster centers.
So this article's nice, it talks about this.
It talks a little bit about how you can choose K to try to get the best
results for your algorithm or for your data set.
I wanted to show you this, though.
Because this is a really nice article that actually gives you
a way of picking a different data set.
So here you can see that here's a data set.
And we can just start adding cluster centers.
And we can see how the algorithm starts clustering things, so let's try one more.
And now, we just click GO and Update Centroids.
And you can see how it just moves the centroids around.
And keep clicking on this until there's no more movement.
And you could notice that right now there's only a little bit of movement here
between the red and black cells there.
As I keep moving around, there's really not much movement, and
we're pretty much done.
This is the end of the algorithm, right?
So you could see that, well, maybe these really were just one cluster here.
Maybe I added too many centroids.
But you can go in and try this with different data sets,
different number of centroids, and really see visually see what's happening.
So let me go ahead and do that.
Let's choose it randomly, here we go, we got 3 clear cells.
So let's put those 3 together, right.
Now, as I go GO, you can see what happens.
Very first thing is that it assigns points to the nearest cluster center.
So these points over here get split to green and blue.
Now, as I click Update Centroids, you notice that, very quickly,
that green center moves over.
So literally within the second running of this algorithm is iteration.
We've already got our 3 centers on the data that they should be assigned to.
Now I reassign points, notice the points immediately go back.
Update my centroids, right, we're almost done, we could keep doing this.
And, look, now, I'm getting no change in the cluster centers.
So, very quickly, within,
basically, 3 iterations, this algorithm found the 3 clusters.
Now, this was easy, because I could see there were 3 clusters at the beginning.
One way we might've done this in practice is try 2 clusters and
see how well it does, try 3, try 4, try 5.
And we could try to figure out some sort of metric for
indicating whether that number of clusters was the right value or not.