0:00

[MUSIC]

Â Hi!

Â In this session we are going to discuss the first group

Â of external measures called matching-based measures.

Â For matching-based measures we will first introduce purity versus maximum matching.

Â We still use a similar notation,

Â suppose we have ground truth we get three categories.

Â T1 actually marked as the brown points.

Â 0:34

T2 marked as a green points.

Â T3 marked as black points.

Â These are the ground truths.

Â Then the clustering we finally got is, cluster C sub 1,

Â actually it's marked as the ellipse of the orange color, okay.

Â 1:05

So what is a purity?

Â Actually quantifies the extent that cluster C sub i contains

Â the points only from one ground truth's partition.

Â Suppose we have R clusters the finer we get.

Â So for purity sub i actually is you first count how many

Â points in this cluster then you try to see the maximum

Â 1:35

The maximum number of ground truths actually is the black points.

Â You will see the max number is one, two, three, four, five, six, seven.

Â So you get this one seven, but the total number is nine,

Â so you get a purity of seven over nine, okay.

Â 1:54

Then what is the total purity of the whole clustering, C?

Â Essentially this total purity is, each one's purity.

Â Suppose you have R clusters.

Â Each one's purity get their corresponding proportion.

Â So you add them together, so that's the total purity.

Â Or you can transform this formula easily into this one, okay.

Â 2:20

Of course, the perfect clustering is in the case when the purity is one,

Â that means everything is very pure, you add them together, it's one.

Â And the number of clusters you obtained is the same

Â as the number of groups in the ground truths.

Â Then we look at these two tables.

Â Let's first look at the green table.

Â The green table, the purity 1.

Â That means the first cluster, you get a maximum of 30.

Â So the purity is 30 over the total number of points.

Â It's 50 so you get 30 / 50.

Â Purity 2, obviously you get 20 / 25.

Â Then purity 3 is 25 / 25, it's really 1.

Â Then the total purity for

Â the whole clustering is, according to this formula, we'll get a maximum y is

Â (30 + 20 + 25) over the total number of points, you get 0.75.

Â 3:20

Interestingly, if you look at the orange table, orange colored table,

Â you will get the exact number of the same thing is 30 over 50, 20 over 25,

Â 25 over 25 and finally to get the purity is exactly the same formula.

Â 3:37

But you may find one undesirable thing for

Â this orange colored table because if you can see,

Â the ground truth T2 total we get 50, the actual

Â was partitioned into two clusters, C1 and C2.

Â That means both C1 and C2 would say, we belong to ground truth T2.

Â 4:02

But on the green table you probably can see, it's different,

Â because C1 more pair with T3 and C2 more pair with T2.

Â So that motivates people proposed maximum matching.

Â That means one cluster can only match to one partition.

Â If we have this rule then we have to find a pairwise matching.

Â 4:28

That means when we look at this we look at the weight, we will allow these

Â element only belong to one cluster.

Â The ij and ij.

Â The ij must be exactly the same then

Â if this element belonged to M then it only belonged to one.

Â So then we need to look at the what is maximum weight matching.

Â The maximum weight matching simply says you look at the whole group and

Â then you look at all over n points

Â you try to find the final matching together should be maximized.

Â 5:17

pairwise with C2, and T1 pairwise with C3, okay.

Â That's why green's maximum matching is equal to purity is 0.75.

Â However, for this orange color table,

Â you probably can see is if you assign T2 to C1.

Â Then you can assign T2 to C2 then you have to assign T3 to C2

Â then finally you get this matching 0.6.

Â However if you assign, T3 to C1, T2 to C2,

Â T1 to C3 you will get matching as 0.65 to

Â that extent this assignment is better because

Â the maximum matching, the value is bigger.

Â So that's the reason for the maximum matching, you do have restriction,

Â you have the pairwise assignment.

Â 6:16

Then, another popular use matching-based measure called F-measure.

Â F-measure has been very popularly used in information retrieval.

Â F-measure is essentially computed by precision and

Â recall, let's look at how the precisioning is defined, okay.

Â 6:35

The precision define is for this particular matching what you want to see

Â is for precision T sub i what you want to find is

Â what's a maximum y matching this particular cluster then divide

Â by the number of total number of points in this cluster so

Â in that sense your precision actually is the same as purity.

Â That means if you get a fraction of C sub i from the majority partition of C sub j.

Â Then the j's partition contains a max number of

Â points from C sub i then you get this formula.

Â For this formula, you get this the precision and the purity,

Â actually this one is the same.

Â You will get 30/50, 20/25 and 25/25.

Â Okay, that's the precision, what about recall?

Â Recall actually is the fraction of the points in the partition

Â shared in common with cluster C sub i.

Â Simply says, if you assign T3 to C1,

Â what you want to see is you've got a 30 but

Â T3 actually got 35 granted you only capture 30.

Â That means you're recall is 30 over 35.

Â That's exactly, you probably can see the formula.

Â Similarly, for the second partition if you assign

Â ground truths T2 to C2, what you can see is,

Â you only get 20/40 so your recall is 20/40.

Â Similarly for the third one when you assign

Â the ground truth of T1 to C3 you get 25/25.

Â That means the fraction of the point in this partition share in

Â common with the cluster C sub i, then you get this recall.

Â 8:41

Now what about an F measure?

Â F measure actually is a harmonic means of precision and recall.

Â So F measure for a cluster C sub i is the harmonic

Â means of precision sub i and recall sub i.

Â If you transform the formula will become like this one.

Â 9:02

Then for F measure for the whole clustering.

Â What you need is you just need to average all the clusters the F measures,

Â that means supposed you get r clusters then you will get F measure from 1 to r.

Â You add them together divided by r, okay.

Â So for this green table what you can see is,

Â where you get F1 is because you got 2 times 30 divided

Â by total is these two add together you get 85.

Â Then for F2 you get 2 times 20 divided by these two adding together, okay.

Â F3 of course is 1, no matter what you calculate it.

Â Then you average all these three.

Â You get F-measure this number.

Â [MUSIC]

Â