Data repositories in which cases are related to subcases are identified as hierarchical. This course covers the representation schemes of hierarchies and algorithms that enable analysis of hierarchical data, as well as provides opportunities to apply several methods of analysis.

Associate Professor at Arizona State University in the School of Computing, Informatics & Decision Systems Engineering and Director of the Center for Accelerating Operational Efficiency School of Computing, Informatics & Decision Systems Engineering

K. Selcuk Candan

Professor of Computer Science and Engineering Director of ASU’s Center for Assured and Scalable Data Engineering (CASCADE)

In previous modules, we talked about different hierarchical data structures,

such as your computer files, organizing genomes and those sorts of things.

And then we began talking about how to visualize those potential node linked

diagrams, but we also talked about just taking a generic data set and

thinking about how to cluster this.

And so again, remember, if we have our library examples, so

we might have book number one.

And book number one, we may have some sort of dataset that looks like

number of pages, for example, on a book.

We might have the average reading level for a book.

We might have, the number of unique words in a book.

So it may have some sort of complex data structure, Book 1, Book 2, etc.

And we may want to see if there's any relationship between these

different types of books.

And hierarchical data analysis provides us with yet

another way to group these different things together to start looking to

see if there's any relationships between these sort of elements.

And in this module, we're going to talk more about how to apply methods of

hierarchical data analysis, and go through examples of agglomerative clustering.

In previous module, we discussed single link clustering where

the minimum distance between two elements, if we have multiple elements in a cluster,

become the element in the distance matrix.

Here we're going to talk about other methods for

distance in agglomerative clustering as well.

And so single-link clustering is determined by one pair of points,

essentially by one link in the proximity graph.

So here, let's imagine that we have five books.

Just like I was sort of showing in our first slide, we had book one.

And [COUGH] and

book 1 had some properties about number of pages, so it may have had 100 pages,

it may have been a fifth grade reading level, and so forth.

And then we have book 2, and maybe book 2 had 10 pages, and

maybe it's a first grade reading level, and so forth, all right?

And we can find the distance from Book 1 to Book 2 using Euclidean

distance or whatever.

And so this entry into our distance matrix is the distance between two books or

we can reverse that, we can think of this,

in this case, this is actually a similarity matrix.

And in this case, they call similarity as sort of the inverse of distance.

If you're very far away from each other, your similarity is close to 0,

and if you're very close to each other, your similarity is 1.

So here people have calculated the similarity.

So you notice along here element 1 and 1 has a value of 1 it's completely similar.

Notice again where mirrored along the diagonal.

And so what we're going to do here is we're going to determine

how to create this proximity graph.

So in previous modules, we talked about choosing the smallest distance,

so if we have a distance matrix, you choose the smallest distance.

This is because we are essentially trying to find which points are the most similar,

so we assume that things that are close together are the most similar.

In a similarity matrix we choose the largest similarity.

The reason for that is because similarity defines how similar things are so

the larger it is the more similar they are so they should be grouped together.

So what we do is we look and we find the largest number that's not one, and

remember we only have to really look at the bottom half of the diagonal.

And so, 1 to 2 becomes the largest similarity.

And so, when 1 to 2 gets grouped together, We're then going to go and

start making our next distance matrix like we talked about.

And this is with our single-link clustering example.

So this is the solution to using this distance matrix and

finding their single link clustering.

So let's walk through exactly how they got to this solution.

So remember, this is with a similarity matrix, right?

So I1 and I2 were paired together.

So we have I1, I2 because we're choosing the highest similarity.

I1, I2, I3,

I4 and I5.

And again, remember

our distance matrix is square.

And remember, in this case we're looking at actually the similarity matrix.

So again along the diagonal and the most similar elements are 1.

So now we want to find the most similar elements.

So if we have I1, I2, and how does this cluster with I3?

Well what we want to do is we want to find the similarity between I1 and

I3, and I2 and I3, because this is all the points in the cluster, right?

And since this is similarity and

we want to find the things that are the most similar, we're going to maximize

the similarity, or minimize the distance like we saw in the previous module.

So I1 to I3 was 0.1, and

I2 to I3, was 0.7.

So the most similar group is I2 to I3,

So this becomes 0.7 here.

Okay?

And we do the same thing for I1, I2.

To I4, so we got I1, I4,

I2, I4, so I1 to I4 was 0.65

And I2 to I4 was 0.6.

So since this is dealing with similarity,

we pick the largest one and 0.65, [COUGH] okay?

So now let's walk through again I1, I2 to I5.

How do we do this?

Again we're trying to maximize similarity between our cluster and a point so

we have to look at the similarity between all points between the two clusters.

Since there's only three points, there's only two similarities.

So I1 to I, was .L2, I2 to I5 was .5, so we're going to wind up with .5 here.

And now we can go back and we can say I3 4,

I4 is point 4 I3 to I5 is .3 I4 to I4 to I5 is .8.

So now we're going to cluster again based on the things that are most similar so

we search our, Data here to find the largest value less than one, so 0.8.

So that means that I4 and

I5 are going to be clustered together in our next distance matrix.

Now a smart question to ask might be, well what if this one had also been 0.8?

If both of those elements are the same,

it just depends on how you programmed up your algorithm to decide which to merge.

Is it going to merge the first one you found?

Or is it merge the last one you found that has equal values?

So, it really depends there we can make a big difference in what occurs.

You could also try merging everything at the same time, if they're equal.

So, there's lots of different choices.

In this case, we just are choosing the 0.8, so

what's our new distance matrix look like?

Well we had cluster I1, I2, I3 didn't get merged with anything yet,

and then we had I4 and I5 as a cluster.

So now before we had a four by four distance matrix, now we have a three

by three distance matrix or I'm sorry similarity matrix in this case.

Okay, so again, diagonal is 1, and

we already know the distance I1, I2 to I3, was 0.7.

We already calculated that.

Oops.

So now, we want the distance I1, I2 to I4, I5.

So remember we had to calculate the similarity between every possible point.

So we got the distance is similarity from I1 to I4,

I2 to I4, I1 to I5 and I2 to I5.

So, I1 to I4, Is 0.65.

I2 to I4, Is 0.6.

I1 to I5 is 0.2, and I2 to I5, it is 0.5.

And then we want the maximum similarity here, so

we can get 0.65 here, so that's the maximum similarity.

And then we have I3 to I4 and I5,

wo we've got I3, to I4, I5, okay?

So that means we've got I3, I4, I3, I5.

And so we can look here again.

I3 to I4 was 0.4, I3 to I5 was 0.3, so

the largest one here would've been 0.4.

So now, we pick our largest structure, was 0.7, so

that means I3 is going to merge with I1, I2.

And so, we're now left with just a two by two

distance matrix, I1, I2, I3, I4, I5.

Oops, And so, we wind up with our two by two distance matrix.

So now let's walk back through what we did.

So this was our first step, after the first step, we merged I1 and I2.

So we have I1 and I2.

So these got merged in our tree.

In our third step, we merged I4 and I5.

And then after that, I3 became merged with I1 and I2.

Now since there's only two clusters left,

of course those last two are going to merge in the final step.

And so our tree winds up like this.

So now let's take a look at the answer we had before is going to show up here,

so let me erase some stuff so it shows up nice.

And we'll see if we [COUGH] have this compared to the answer we had before.

And so we can see 1 and 2 merged.

And then 3 merged with those, and 4 and 5, and so

we get the same sort of cluster that we had in this example.

So this is how we would walk through a single link clustering example with

the similarity matrix.

So remember, with the similarity matrix, we're trying to find the max similarity.

If we had had this as a distance matrix it would be find the minimum distance.

So it just depends if you're defining distance or similarity.

Now, a single link clustering,

there's a whole bunch of different ways this can look.

So here's another example and, with single-link clustering,

we're going to wind up with these nested sort of structures where 2 and

5 are the closest to each other, 3 and 6, then those are closest.

Then again we might end up getting 4, then number 1 so we get this sort of structure

where the closes things merge and you get this sort of nested structure out.

Now this is great if you expect some sort of nested structure in your data set, but

it doesn't always mean that this is the best sort of choice.

So the strengths to single link clustering are, if this our original points,

it's going to wind up splitting these into two nice clusters.

So we got a blue cluster and a red cluster.

So singling cluster will work really well with this if we have sort of this split,

and it can handle non-elliptical shapes.

The limitations though are,

if we have any sort of overlap here we're going to get these weird.

Elongated sorts of clusters, and it's become sensitive to noise and outliers.

If you look really closely here, these points all the way out here become blue,

and so we have a blue cluster that goes sort of all the way here,

we have a red cluster that goes to here.

So we have a whole bunch of weird overlap that occurs due to this noise and

outliers that can produce these long, elongated clusters.

So it's not good at handling this sort of noisy data structure.

So if you think your data has some clear separations, it will work pretty well.

If you have some noise and overlap,

single-link agglomerate of clustering will have problems.

Now we don't have to do single-link though,

we can actually do what's called complete-link.

Complete-link and the complete-link distance between clusters is the maximum

distance between any object, or the smallest similarities between them.

So distance is defined by the two most dissimilar objects, and so

we can walk through that exact same dataset now with complete-link clustering.

So, where students get confused is in the selection choice.

So with single link clustering, we're always taking the maximum similarity.

In complete-link clustering we're taking the minimum similarity.

That's only to fill out the distance matrix.

If the distance matrix is already, once we fill out the distance matrix,

we are taking highest similarity here.

So, the higher similarity is still 1 and 2.

So 1 and 2 get merged first no matter what is in this example.

So now, our distance metrics becomes I1, I2,

[COUGH] I3, I4 and I5, but where the choice

between complete and single link comes in,

is going to be in this step on how we fill in the values here.

So I1 and I2 merged because this was the largest

similarity in the similarity matrix, all right?

So now, we want I1, I2, To I3.

So we've got I1, I3, and I2, I3 and

the distance from I1 to I3 is 0.1.

And the distance from I2 to I3 is 0.7, okay?

In single-link, we put 0.7 there,

because we wanted the two most similar objects between clusters.

In complete-link, we take the biggest distance or the most dissimilar ones.

So this becomes 0.1, okay?

So now, let's try this again for I1, I2, to I4.

So we've got I1, I4, I2,

I4, so I1, I4 is 0.65,

I2, I4 is 0.6.

So what do we pick, the most dissimilar, so 0.6 here.

And now we do the same for I5.

So we got I1, I5, I2, I5, so

we look at our matrix, I1 the I5 is 0.2.

I2 to I5 is 0.5, so we've got, 0.2, all right?

But I3 to I4 is still 0.4.

And I3 to I5 is 0.3 and

I4 to I5 is 0.8.

So still no big difference, because the most similar value is 0.8, so

remember always merging on the most similar, so

we fill out the distance matrix differently, but

once we have the distance matrix filled out, we always merge on the most similar.

So now, we're back to [COUGH] very close to what we had with single-link.

All right, so we know the distance

from I1, I2 to I3 is 0.1.

And now we want I1,

I2 to I4, I5, right?

So that means I1, I4, I1,

I5, I2, I4, I2, I5.

So I1, I4 was 0.65, I1, I5 is 0.2.

I2, I4 is 0.6 and I2,

I5 is 0.5.

Now remember again, in complete-link,

it is the ones that are the farthest away, so the most dissimilar.

So 0.2, as we wind up filling in 0.2 here, all right?

And now we need I3 to I4 and I5.

So I3 to I4 is 0.4, I3 to I5 is 0.3.

So the smallest one is 0.3.

And so now, we see that actually I3 merges to I4 and

I5 which is different than what we had before.

So then we wind up with our final distance matrix is I2,

I1, I3, 4, 5, and so forth.

So we're going to wind up now with 1 and 2 merged to begin with, then 4 and 5 merged.

Then 3 now merged with 4 and 5.

So we wind up with a tree that looks like this.

So let's check our answer.

And we see 1 and 2 merged, 4 and 5 merged, and 3 merged with 4 and 5.

And so we get a different tree structure here than we had with single-link.

And so, as compared to single-link, what complete-link,

[COUGH] remember in single-link we had sort of this nested structure of points.

So remember we had 5 and 2 were next to each other, then 3 and 6, and

we sort of had these circles going out from there.

In complete-link, since we're taking the farthest distance apart,

we get this separation between [COUGH] the points like this.

Now, the strength of complete-link is that we have this noisy example like we

had before.

You get more balanced clusters with equal diameter.

So we get sort of the blue red split here.

If you look at the data, we sort of have red, and we have blue.

So it does pretty good at ignoring noise.

It's less susceptible.

But since they have this sort of equal diameter,

it's not very good with [COUGH] this sort of small and large clusters.

It tends to break up large clusters.

So what we're seeing here, is this is our original points.

Complete-link winds up making a cluster like this and a cluster like this.

The clusters tend to have the same diameter, so

small clusters get merged with large ones.

So again, depending on how we think our underlying data structure might be,

we may not want to choose complete-link.

This dataset would've been much better for perhaps single-link, but

this one would have been much better for complete-link.

And so, again, there's limitations to all of these and

we have to be aware of what these are.

And this may require us to do multiple sorts of algorithms on the same data to

begin exploring and understanding exactly what's going on there.

And we don't just have to stick to single-link and complete-link.

There's group average distance.

So the average distance between clusters Ci and

Cj is the average distance between any object in those.

We can again take that same [COUGH] data structure here.

And so for average length,

it's the proximity of two clusters, it's the average of pairwise proximity.

So this is our similarity matrix again.

So what do we do?

We merge the elements that are the most similar.

So what do we merge?

Well, 1 and 2, because they're the most similar again.

It doesn't matter if you're doing average, single, complete or whatever,

the first merge winds up being the same here, because we're stuck with that

original data matrix, that original similarity or distance matrix.

We have I1, I2, I3,

I4, I5, I1,

I2, I3, I4, and I5.

And we're not going to walk through this whole example this time, but

let's go through this first check of how do we get,

I1, I2, the distance to I3?

So now it's going to be the similarity from I1 to I3 and I2 to I3.

So I1 to I3 was 0.,1 I2 to I3 was 0.7.

So average means we're going to average these two.

So we're going to take 0.1 + 0.7 divided by 2,

so we had 0.8, so we get 0.4.

Okay, and let's try this again for I1, I2 to I4 just so we get the hang of this.

So we've got I1 to I4 + I2 to I4.

Since we have only 2 pairs divide by 2, so I1 to I4 is 0.65.

I2 to I4 is 0.6, so we'd have 0.65 plus 0.6 divided by 2, and so forth.

And so this becomes 1.25 divided

by 2 is 0.6 25.

And let's take a look at what this tree looks like,

if we were to continue this out.

If we're to continue this out, in this case,

we wind up with the same result as when we did the single link clustering.

Now don't be fooled though, this won't always occur depending on the data set.

This is a relatively toy example.

And so we get a little bit of nested clusters.

But you can see in that data example that we had for single link and complete link.

Now we're seeing something quite a bit different here.

And it is a compromise between single and complete links, so

it's less susceptible to noise and outliers, but

it's still biased towards these sorts of globular clusters.

And so it's a nice compromise, again,

another method that we can use, but there's a whole lot of choices,

we can also use the centroid distance between clusters.

So, for example, if we have multiple points, finding

the centroid of those points and then taking the distance between the centroids.

We can use ward's distance between clusters,

whereas the distance between the total within clusters sum of squares.

And the within cluster sum of squares resulting from merging two clusters.

So we've talked about single link, complete link, average clustering,

centroids, ward's distance, so

five different ones right there off the top of our heads.

And again, there's pros and cons for these.

With ward's distance, it's going to be similar to the group average and

centroid distance.

It's less susceptible to noise.

It's biased towards globular clusters.

But what's nice is, it's also hierarchical sort of analogue of k-means.

And so we can even use this ward's distance for

clusters to initialize our k-means seeds, for example.

And so here, let's just take a look at a comparison between

what these would look like for another toy dataset.

So with single link clustering,

also called min link, we get this sort of nested structure that we're seeing here.

With max, we sort of get these splits.

With group average, we get some splits but also a nested structure.

And with ward, again we get a sort of combination between those.

So again, all of these have pros, cons, and limitations.

And the real trick here is to think about

what ones we want to try between these different methods, how do they compare,

what all do they tell us about the underlying data.

And by doing this, this builds this hierarchical structure

which then we can think about how to visualize and how to explore.

There's also computational requirements for this, and

again like we said it becomes trickier as the dataset gets large.

For a dataset consisting of n points, it requires big O(n squared) space,

because we have to store that distance matrix.

And then for most cases,

the time complexity is big O(n cubed), because there's n steps, and

at each step the size n squared distance matrix has to be updated and searched.

So you're searching this [COUGH] n squared space n times, so for

most of the time you get n cubed.

People have worked to be able to reduce the complexity, the big O(n squared

log(n)), for some approaches by using appropriate data structures.

But you can see this is a relatively expensive computation to do.

So while it's interesting and can provide us information,

it may not be appropriate for sorts of all data.

Likewise, distinct clusters are not produced.

So methods for producing distinct clusters involve

specifying a somewhat arbitrary cutoff value.

And we don't always know if our data structure has a hierarchical structure

embedded in there, so

maybe that hierarchical cluster is not always appropriate.

So again, part of the goal between all of these modules for data exploration and

visualization is to provide you a whole bunch of different tools for

your data detective tool belt.

But just like in fixing things in your house, not every tool is appropriate.

We always say if the only tool you have is a hammer, everything's a nail.