[SOUND] Hi, in this session we are going to introduce an interesting hierarchical clustering algorithm extension called CHAMELEON which is a graph partitioning on the KNN, means Kenya's neighbor graph of the data. So, this graph partitioning based approach was developed in 1999 by a group of researchers at the University of Minnesota. CHAMELEON actually has some interesting points. It measures the similarity based on a dynamic model. Essentially two clusters are merged. Only if the interconnectivity and the closeness between the two clusters are high relative to their internal interconnectivity of the clusters. And a closeness of items within the clusters. CHAMELEON is a graph-based, two-phase algorithm. In the first phase, it uses a graph partitioning algorithm. Essentially, it cluster objects into a large number of relatively small sub-clusters. Also, we call it graphlets. Then, in the second phase, it uses an agglomerative hierarchical clustering algorithm to find the genuine clusters by repeatedly combining these sub-clusters or these graphlets. Let's look at the overall framework of CHAMELEON. Suppose we have a large data set. Then we can use K-NN graph to merge them into sparse graph like this. What is K-NN graph? Essentially we can see that two points, P and Q, are connected. If Q is among the top k closest neighbors of p, then based on this k nearest neighbor, you construct this graph. Then we'll repartition this graph into a good number of smaller subclusters, or called graphlets. After this partitioning, we are going to merge them into high quality clusters. How we merge those graphlets back to the bigger clusters? Essentially the, the merge is based on these two measures. One we call relative interconnectivity, another we call relative closeness. Relatively interconnectivity means the connectivity of C sub 1 and C sub 2 over their respective internal connectivity. Okay. Then the relative closeness is defined as closeness between clusters C sub 1 and C sub 2 over their respective internal closeness. Let's look at the, the messert in some detail. The first is KNN graph. What is KNN graph? Let's look at the example. This is the original two dimensional data set. So the first nearest neighbor, one nearest neighbor graph is for each point, each you know, object. You just find the top one closest neighbor. For example, this ones closest neighbor could be here, this ones close neighbor could be here. Then you construct a subgraph based on these you get a one, you get a one, one nearest neighbor graph. For two nearest neighbor graph is you, for each point, you look their two, top two nearest neighbor, construct a graph like this. Three nearest neighbor graph is for each point, you look at their top three nearest neighbors in a constructed graph like this. Okay? Then, we look at their absolute interconnectivity between C sub i and C sub j. Okay. These absolute in, interconnectivity is, essentially, the sum of the weight of the edges. That connects vertices in C sub i to vertices in C sub j. Essentially you look at the weighted sum of these edges. Then the internal connectivity of the cluster C sub i essentially the size of its min-cut bisector, the weighted sum of these edges partition the graph into two roughly equal parts. Then you look at the weighted sum of the edges of these partition. Then the relative interconnectivity, RI, essentially is for C sub i and C sub j is defined by their ab, absolute interconnectivity divided by or you can say normalized by the average of their respective interconnectivity. Then we look at the relative closeness and the merge of sub-clusters. Relative closeness between a pair of clusters C sub i and C sub j is defined as follows. The first is what is absolute closeness between C sub i and C sub j. The absolute closeness essentially is the average weight of the edges connected these, the vertices between these two clusters. Then their internal essentially the internal closeness like this internal closeness of C sub i internal closeness of C sub j, is the average weight of the edges that belong to the min-cut bisector of clusters C sub i and C sub j, respectively. That means you look at what is min-cut bisector cut this cluster, or cutting this cluster. Then you'll find, their average weights of the edges. Okay. Then you probably can see the relativeness is, is the absolute closeness normalized with their inproportionally defined internal closeness. Then how do we merge sub-clusters? Essentially is we will only merge those pairs of clusters whose relative interconnectivity and relative closeness are both above some user specified threshold. That means we only merge those maximizing the function that combines relative interconnectivity and relative closeness. That means even we find those pair are mergeable, we still want to find these merge, we're maximizing the function. Then based on the implementation and experiments we properly can see, CHAMELEON actually can cluster complex objects like this. Okay. This actually, these are all the different points with different density different shapes, as well as lots of noise points. Even with different shapes, different density and different shapes and different noises, we still can see CHAMELEON actually generates pretty high quality clusters. Okay. That simply says CHAMELEON using this graph partioning and emerging methods is pretty robust and generate a pretty high quality clusters. Even it is in the spirit of hierarchical clustering, it does do a lot of good with this graph-based method. [MUSIC]