0:00

So in R, the kmeans function is the function that

Â we use to it, to implement the kmeans algorithm, here.

Â And you can a, I, you can dem, I can

Â demonstrate it here with use, using a simple data frame.

Â I got two, with just two dimensions.

Â And I called kmeans on the data frame and I tell it there are three centers.

Â Alright, so three centroids and what kmeans returns is

Â a, is a list with a number of different

Â elements in it.

Â And so, for example, the, probably the most important element

Â is the cluster element and you can see here when

Â I print out the cluster element, you can see that

Â it's a vector of numbers from one to three so.

Â And what this shows is that for each data point in the in

Â the data frame that I passed it, it tells me which cluster it's in.

Â So you can see that the first four points are in cluster

Â three, the next four are in cluster one, and the next four

Â are in cluster two.

Â You can see another element if you look in the printout of the

Â names another element of the, of the

Â object returned from kmeans is called centers.

Â And this tells you the location of the centroids in the space.

Â 1:08

So, if you want to plot the kind of, the, the results for kmeans, the

Â first thing you can do is you can run the kmeans algorithm on your data.

Â And here, I'm just going to plot the data.

Â So the first thing I do is I plot the data, so I plot X Y

Â and then I color the data points according to the cluster that they happen to be in.

Â So you can see I, I pass the

Â color argument to be equal to the cluster number.

Â And then I, I use the points function to

Â kind of add the centers, the clusters centroids to

Â the plot and I, and I plot them using the plus symbol.

Â So, here I plotted the, the data and the kmeans clustering results.

Â 1:41

Finally, another way that you can visualize clustering

Â information for an outcome in an algorithm like

Â kmeans is by using the Heatmap function or,

Â or use, looking at heatmaps I should say.

Â So here I've just, I've, I've using the same data

Â I've taken out a different random sample of the data set.

Â I sampled its replacement and I just run kmeans again, again with

Â three centers.

Â And I stored it in an object called kmeans object two.

Â And now I'm going to make a, an image plot of the data, so

Â the first plot on the left here is just an image of the original data.

Â 2:10

And then on the right hand side I've reordered

Â the the columns of the data, I'm sorry, I

Â should say the rows of the data frame so,

Â so that the clusters are kind of put together.

Â So here, you can see that if you go up and down the, up and down this matrix.

Â You'll see the cluster, the, the data points are

Â clustered together so that they are next to each other.

Â And so you can use this to look at high dimensional

Â data, and high dimensional image type data, or matrix type data

Â where you can reorganize the rows and the columns and kind

Â of look at clusters that are closer together or farther apart.

Â and, and, and, or, kind of, and in it, and so

Â look at your kind of matrix data in an organized way

Â so you can look for, so you can look for patterns.

Â We'll talk a little bit about this more

Â when we talk about hierarchical clustering, but again, you

Â can, you can use heatmap type of visualizations

Â with other types of clustering algorithms like kmeans too.

Â So it's, just to summarize, you know, kmeans is a handy

Â algorithm for organizing and looking for patterns that hide eventual data.

Â A couple of things that I for, it requires that you know the number of clusters.

Â So you have to specify

Â at least roughly speaking, how many clusters there are.

Â You can, you can kind of play with that a little bit to determine, to figure

Â out kind of what, what pattern probably looks

Â the best, but there's no easy rule there.

Â And then so you have to pick those clusters

Â out by eye or sort of through some other mechanism.

Â There are a few algorithms for kind of determining the number

Â of clusters using, either using cross-validation,

Â information theory, other types of metrics.

Â And so there's,

Â here's a link to determining the number of clusters.

Â And it's, and the kmeans algorithm is not deterministic, so there are, depending

Â on how it's implemented, there can be,

Â sometimes those starting points are chosen at

Â random, and so, so it's often useful to run the kmeans algorithm a

Â couple times just to make sure you're not getting a very unstable finishing point.

Â So, for example, if you run it three different times,

Â and every time you get a totally different pattern, then that

Â means that the, the algorithm may not be, have

Â a very stable kind of view of the data.

Â And so, so, kmeans is, can, can be

Â problematic in that way for certain types of datasets.

Â And here, I've got a couple links to kind of, to videos

Â and, and references on, that, that provide a lot more information about kmeans.

Â