0:00

[SOUND] In this session,

Â we were introduce Internal Measures for

Â Clustering Validation.

Â Remember, we just discussed several kinds of external measures.

Â The lucky thing for external measure is we do have [INAUDIBLE].

Â We have expert knowledge, we have a priors.

Â Unfortunately, in many cases, we do not have [INAUDIBLE].

Â That's the reason we have to rely on internal measures.

Â The internal measures based on the concept of clustering.

Â It means we want the points within the same cluster.

Â That means intra-cluster, they are very compact.

Â They are close to each other and

Â we wanted points in different clusters that means inter-cluster distance.

Â We want them as far apart as possible.

Â That means we want them to be far separated.

Â That's the reason we want to maximizing the intra-cluster compactness and

Â also maximizing the inter-cluster separation, but

Â we want a tradeoff between these things.

Â So let's look at how we define this.

Â Suppose we give a clustering C with k

Â clusters, C sub 1to C sub k.

Â For each cluster C sub i, it contains n sub i number of points.

Â Then we introduce the function W of S, R,

Â which is the sum of the weights of all the edges

Â with one verdicts in S, the other in R.

Â For example, you'd get a two points.

Â You say, this point is in this cluster.

Â The other point is in the other cluster.

Â You want to look at their weight.

Â The weight usually can be defined like a euclidean distance or

Â whatever you are using for clustering.

Â Then based on this, we can say, the sum of all the intra-cluster

Â weights over all clusters should be defined this way.

Â The reason is you can see this is a C sub i, this is a C sub i.

Â That means both vertices are in the same cluster, so that's why it's inter-cluster.

Â But at one point A and the other point B, you actually use it will calculate twice,

Â because A to B, B to A, you're actually calculating twice.

Â That's why you have to divide this by two.

Â Okay.

Â For all the clusters sum up together is i from one to

Â the k's clusters, then we define the sum of all the inter

Â cluster weights should be calculated this way.

Â There, which is very obvious, because one point,

Â one vertex is in the cluster C sub i and the other one is not in the C sub i.

Â Simply say, in the other clusters, that's why it's intro cluster.

Â But since the two points, you calculate it twice,

Â that's why you have the divide by 2.

Â Then we sum up all these essentially, it's i from one to k, all the clusters.

Â We get a Wout, okay?

Â That's the inter-cluster weights sum them together.

Â Now we also can see the number of distinct intra-cluster edges,

Â because for i's cluster if n sub i number of points

Â there intra-cluster link, you have end points choose two.

Â That's why you get this formula.

Â Then get all the clusters one to k, you sum up, that's why you get n sub m.

Â Then for the inter-cluster N sub out is notated by one

Â point is in cluster i, the other point is cluster j.

Â They never are in the same cluster.

Â So you want to calculate all the edges, among different clusters.

Â So that's the number of distinct inter-cluster edges.

Â Then for BetaCV measure is defined as Wi divided by N and

Â you can think this is the mean intra-cluster distance and

Â you get Wout divided by number of out,

Â you can think is a mean inter-cluster distance.

Â So the BetaCV essentially the ratio of mean inter-cluster

Â distance versus the mean intra-cluster distance.

Â Of course, for this measure is the smaller, the better clustering.

Â Because a smaller means, this is pretty small is quite a compact.

Â And this is the bigger, it means it's more separated.

Â That's a BetaCV measure defined, then we look at some other definition.

Â One popular use one called normalized cut.

Â Normalized cut is defined as follows.

Â The normalized cut of this ratio is defined by sum up of all the k clusters.

Â For every cluster, you calculate in this way or

Â more explicitly, we can interpret in this way.

Â This is ways of the inter-cluster.

Â This is a weights to sum up of, of the intra and inter-cluster weights.

Â And for this ratio, you actually sum from i from one to k,

Â you'll get these normalized cut ratio.

Â Of course, based on this, you want to the inter-cluster

Â to be far separated, comparing to the intro one.

Â That's why the bigger normalized cut value, the better the clustering.

Â Then there's another definition called modularity.

Â This is especially popular for graft clustering.

Â It is defined in this way.

Â You proceed, this part is the same as normalized cut, cut formula.

Â Okay.

Â Like this, but this part is the square of this.

Â Okay.

Â So that simply says, this one is the observed

Â the fraction of the weights within clusters.

Â This is the expected one.

Â So modularity, actually measured the difference between the observed one and

Â the expected one.

Â So the smaller the value, the better the clustering,

Â because the inter-cluster distance are lower than the expected one.

Â So this is the modularity definition

Â used for graph clustering.

Â [MUSIC]

Â