0:18

In particular, we're going to to cover different kinds of approaches

Â than generative models, and that is similarity-based approaches.

Â So the general idea of similarity-based clustering is to explicitly

Â specify a similarity function to measure the similarity between two text objects.

Â Now this is in contrast with

Â a generative model where we implicitly define the clustering bias

Â by using a particular object to function like a [INAUDIBLE] function.

Â 0:52

The whole process is driven by optimizing the [INAUDIBLE,] but

Â here we explicitly provide a view of what we think are similar.

Â And this is often very useful because then it allows us to inject

Â any particular view of similarity into the clustering program.

Â So once we have a similarity function, we can then aim at optimally partitioning,

Â to partitioning the data into clusters or into different groups.

Â And try to maximize the inter-group similarity and

Â minimize the inter-group similarity.

Â That is to ensure the objects that are put into the same group to be similar, but

Â the objects that are put into different groups to be not similar.

Â And these are the general goals of clustering,

Â and there is often a trade off between achieving both goals.

Â Now there are many different methods for doing similarity based clustering,

Â and in general I think we can distinguish the two strategies at high level.

Â One is to progressively construct the hierarchy of clusters, and

Â so this often leads to hierarchical clustering.

Â And we can further distinguish it two ways, to construct a hierarchy

Â depending on whether we started with the collection to divide the connection.

Â Or started with individual objectives and gradually group them together, so

Â one is bottom-up that can be called agglomerative.

Â Well we gradually group a similar objects into larger and larger clusters.

Â Until we group everything together, the other is top-down or divisive, in this

Â case we gradually partition the whole data set into smaller and smaller clusters.

Â The other general strategy is to start with the initial tentative clustering and

Â then iteratively improve it.

Â And this often leads for a flat clustering, one example is k-Means,

Â so as I just said, there are many different clustering methods available.

Â And a full coverage of all the clustering methods would be

Â beyond the scope of this course.

Â But here we are going to talk about the two representative methods, in some detail

Â 3:14

one is Hierarchical Agglomerative Clustering or HAC, the other is k-Means.

Â So first of it we'll get the agglomerative hierarchical clustering, in this case,

Â we're given a similarity function to measure similarity between two objects.

Â And then we can gradually group similar objects together in a bottom-up fashion to

Â form larger and larger groups.

Â And they always form a hierarchy, and

Â then we can stop when some stopping criterion is met.

Â It could be either some number of clusters has been achieved or

Â the threshold for similarity has been reached.

Â 3:52

There are different variations here, and

Â they mainly differ in the ways to compute a group similarity.

Â Based on the individual objects similarity, so

Â let's illustrate how again induced a structure based on just similarity.

Â So start with all the text objects and

Â we can then measure the similarity between them.

Â Of course based on the provided similarity function, and

Â then we can see which pair has the highest similarity.

Â And then just group them together, and

Â then we're going to see which pair is the next one to group.

Â 4:50

Now, depending on our applications,

Â we can use the whole hierarchy as a structure for browsing, for example.

Â Or we can choose a cutoff, let's say cut here to get four clusters,

Â or we can use a threshold to cut.

Â Or we can cut at this high level to get just two clusters, so

Â this is a general idea, now if you think about how to implement this algorithm.

Â You'll realize that we have everything specified except for

Â how to compute group similarity.

Â 5:24

We are only given the similarity function of two objects, but as

Â we group groups together, we also need to assess the similarity between two groups.

Â There are also different ways to do that and there are the three popular methods.

Â Single-link, complete-link, and average-link, so

Â given two groups and the single-link algorithm.

Â Is going to define the group similarity as the similarity of the closest pair of

Â the two groups.

Â Complete-link defines the similarity of the two groups

Â as the similarity of the farthest system pair.

Â Average-link defines the similarity as average of similarity of

Â all the pairs of the two groups.

Â So it's much easier to understand the methods by illustrating them,

Â so here are two groups, g1 and g2 with some objects in each group.

Â And we know how to compute the similarity between two objects, but

Â the question now is, how can we compute the similarity between the two groups?

Â 6:57

The complete link on the other hand were in some sense pessimistic, and by

Â taking the similarity of the two farthest pair as the similarity for the two groups.

Â So we are going to make sure that

Â if the two groups are having a high similarity.

Â Then every pair of the two groups, or

Â the objects in the two groups will have, will be ensured to have high similarity.

Â 7:29

Now average link is in between, so it takes the average of all these pairs.

Â Now these different ways of computing group similarities will lead

Â to different clustering algorithms.

Â And they would in general give different results, so

Â it's useful to take a look at their differences and to make a comparison.

Â 8:27

is well connected with the other group.

Â So the two leaders of the two groups can have a good

Â relationship with each other and then they will bring together the two groups.

Â In this case, the cluster is loose, because there's no guarantee that

Â other members of the two groups are actually very close to each other.

Â Sometimes they may be very far away, now in this case it's also

Â based on individual decisions, so it could be sensitive to outliers.

Â The complete-link is in the opposite situation,

Â where we can expect the clusters to be tight.

Â And it's also based on individual decision so it can be sensitive to outliers.

Â Again to continue the analogy to having a party of people,

Â then complete-link would mean when two groups come together.

Â They want to ensure that even

Â 9:37

going to be insensitive to outliers, now in practice which one is the best.

Â Well, this would depend on the application and sometimes you need a lose clusters.

Â And aggressively cluster objects together that maybe single-link is good.

Â But other times you might need a tight clusters and

Â a complete-link might be better.

Â But in general, you have to empirically evaluate these methods for

Â your application to know which one is better.

Â 10:07

Now, next let's look at another example of a method for similarity-based clustering.

Â In this case, which is called k-Means clustering,

Â we will represent each text object as a term vector.

Â And then assume a similarity function defined on two objects, now we're going to

Â start with some tentative clustering results by just selecting k randomly.

Â selected vectors as centroids of k clusters and

Â treat them as centers as if they represent, they each represent a cluster.

Â So this gives us the initial tentative cluster,

Â then we're going to iteratively improve it.

Â And the process goes like this, and once we have these centroids Decide.

Â We're going to assign a vector to the cluster whose

Â centroid is closest to the current vector.

Â So basically we're going to measure the distance between this vector, and

Â each of the centroids, and see which one is the closest to this one.

Â And then just put this object into that cluster,

Â this is to have tentative assignment of objects into clusters.

Â And we're going to partition all the objects

Â into k clusters based on our tentative clustering and centroids.

Â 11:28

Then we can do re-compute the centroid based on

Â the locate the object in each cluster.

Â And this is to adjust the centroid, and

Â then we can repeat this process until the similarity-based objective function.

Â In this case, it's within cluster sum of squares converges, and

Â theoretically we can show that.

Â This process actually is going to minimize the within cluster sum of squares

Â where define object and function.

Â Given k clusters, so it can be also shown,

Â this process will converge to a local minimum.

Â I think about this process for a moment, it might remind you the Algorithm for

Â mixture model.

Â 12:34

And then in the Algorithm, you may recall that,

Â we're going to repeat E-step and M-step to improve our parameter estimation.

Â In this case, we're going to improve the clustering result

Â iteratively by also doing two steps.

Â And in fact that the two steps are very similar to Algorithm, in that when we

Â locate the vector into one of the clusters based on our tentative clustering.

Â It's very similar to inferring the distribution that has been used to

Â generate the document, the mixture model.

Â So it is essentially similar to E-step, so

Â what's the difference, well the difference is here.

Â We don't make a probabilistic allocation as in the case of E-step,

Â the brother will make a choice.

Â We're going to make a call if this, there upon this closest to cluster two,

Â then we're going to say you are in cluster two.

Â So there's no choice, and

Â we're not going to say, you assume the set is belonging to a cluster two.

Â And so we're not going to have a probability, but

Â we're just going to put one object into precisely one cluster.

Â In the E-step however, we do a probability location, so we split in counts.

Â And we're not going to say exactly which distribution has

Â been used to generate a data point.

Â Now next, we're going to adjust the centroid, and

Â this is very similar to M-step where we re-estimate the parameters.

Â That's when we'll have a better estimate of the parameter, so

Â here we'll have a better clustering result by adjusting the centroid.

Â And note that centroid is based on the average of the vectors in the cluster.

Â So this is also similar to the M-step where we do counts,pull together counts

Â and then normalize them.

Â The difference of course is also because of the difference in the E-step, and

Â we're not going to consider probabilities when we count the points.

Â In this case,

Â k-Means we're going to all make count of the objects as allocated to this cluster.

Â And this is only a subset of data points, but in the Algorithm,

Â we in principle consider all the data points based on probabilistic allocations.

Â 14:56

But in nature they are very similar and

Â that's why it's also maximizing well defined object of functions.

Â And it's guaranteed to convert local minimum, so

Â to summarize our discussion of clustering methods.

Â We first discussed model based approaches, mainly the mixture model.

Â Here we use the implicit similarity function to define the clustering bias.

Â There is no explicit define similarity function, the model defines clustering

Â bias and the clustering structure is built into a generative model.

Â That's why we can use potentially a different model to

Â recover different structure.

Â 15:40

Complex generative models can be used to discover complex clustering structures.

Â We do not talk about in full, but we can easily design,

Â generate a model to generate a hierarchical clusters.

Â We can also use prior to further customize the clustering algorithm to for

Â example control the topic of one cluster or multiple clusters.

Â However one disadvantage of this approach is that there is no easy way to

Â directly control the similarity measure.

Â 16:35

The k-Means algorithm has clearly defined the objective function, but

Â it's also very similar to a model based approach.

Â The hierarchical clustering algorithm on

Â the other hand is harder to specify the objective function.

Â So it's not clear what exactly is being optimized,

Â 17:00

both approaches can generate term clusters.

Â And document clusters, and term clusters can be in general,

Â generated by representing each term with some text content.

Â For example, take the context of each term as a representation of each term,

Â as we have done in semantic relation learning.

Â And then we can certainly cluster terms, based on actual text [INAUDIBLE].

Â Of course, term clusters can be generated by using generative models as well,

Â as we've seen.

Â [MUSIC]

Â