[SOUND]. In this session we are going to introduce another group of external measures called entropy-based measures. We know entropy is a very useful information theory measure. Not only is information theory also used in data mining, mission learning quite a lot. Entropy essentially is representing the amount of otherness of the, of the information in all the partitions. So we still use this graph, this figure to represent conceptually we have ground truths represented by different color points, we have clusters represented by different ellipses. For entropy of clustering C. Suppose we got R clusters. It's, it's essentially this is the entropy of every particular cluster. Then we add all these clusters together we get entropy for the clustering C. Then for each cluster, actually the entropy is based on the probability of C sub i. The probability of clustering C sub i is based on the number of points in this cluster divided by the total number of points. Then entropy of partitioning T is essentially defined a similar way. Suppose we have ground truths j from 1 to k, these k groups. Okay? So for each ground truth is, PT sub i we actually can get a, the probability of the ground truth as defined by these P sub t, sub i, okay. Then we add all their entropies together for all the K ground truths we get a ground truth for the whole partition. Then what we're interested in is the entropy of T with respect to clustering C sub i. That means we want to see how the ground truth is distributed within each cluster. So you probably can see this j represent the ground truth, and the i represents the really clusters. So we probably want to see such distribution so we can work out the entropy of T with respect to the cluster C sub i. Then if we want to get to the conditional entropy of T with respect to whole clustering C, then we just add all these together. Because this one is just for cluster C sub i, but we total have r clusters, we add all these up proportionally, then we will get the whole conditional entropy of T with respect to the whole clustering of C. Conceptually, we can see the more of a cluster's membership has split into different partitions, the higher the conditional entropy. That's the less desirable. You probably can see if, if you're partitions, wide spread into different clusters it is no good. Okay. For perfect clustering the conditional entropy value should be 0. The worse conditional entropy value, is log k. We can use a transformation formula like this. We can transfer on this conditional entropy To be the joint-entropy, minus the clustering's entropy. So, you probably can see, that's the conditional formula transformation. We're not getting to detail, but you can check it. Another very useful measure used for external measure is Normalized Mutual Information. We use a similar figure to sketch the idea. The Mutual Information is also defined in, information's theory, introduced there. But it's also very useful in machine learning and data mining. Okay. Essentially, the mutual information quantifies the amount of shared information, between the clustering C and the partitioning T. So you can probably see the formula, we have r clusters, we have k ground truths. So they are mutual information essentially so adding all these up together. This is a similar formula as entropy, but it's different because you can probably see this part is really, this ij's probability divided by this cross rings property and a partitioning's property. That means we want to measure the dependency between the observed joint probability p sub ij of C and T, and the expected joint probability of PC sub i and PT sub j. Under the independence assumption. Of course, if c and t are really independent that means they really want equal distance to this. Then you probably can see this is one actually when you take log this becomes 0, okay. Of course in this case, this is no good because it implies these ground truths actually scatter r around different clusters. However there's no upper bound on the mutual information, which is less desirable. That's why we need to introduce a normalized mutual information. That means we want to normalize the range from 0 to the highest wines one. Then the value close to 1 actually indicates a good clustering, a value close to 0 means they are almost accompanied to random independent assignment. Okay, then this normalized mutual information essentially is you take the mutual information. Divide by the entropy of clustering and divided by entropy of partitioning. You product them together, take their square root, or you can transfer in this way. So this is, quite useful, because you will know once you're clustering based on your external measure. Your cluster becomes perfect if their value is very close to one. [MUSIC]