0:00

In this lecture, I will show you how to make a clustergram in MATLAB.

Hierarchical clustering, is another way to

visualize high-dimensional data, and it clusters

observations by distance and builds a hierarchical structure on top of that.

It gives more detailed information of differences among clusters.

For example, it can tell you which

genes contributed the most to the difference between

two clusters.

Here is an example of hierarchical clustergram.

It is made of a heat map in the middle.

Denograms on the left and top.

And row and column labels on the right and on bottom.

There is also a scale bar on the left.

This is the same data set as I used in the PCA plotting.

Each column is one tumor cell gene expression profile.

And each row is

a gene.

The color suggests relative expression values.

And red indicates high expression values, blue indicates lower expression values.

Looking at the column labels, we find that

gene expression profiles

of the same subtype, nicely clustered together.

And there are three red clusters in the

heatmap

corresponding 3 subtypes.

Recall that the colors suggest expression

values, we can say that this bunch of

genes at the upper side are highly

expressed in cluster one which are subtype three.

And these genes in the middle are highly expressed in subtype two.

And these genes at bottom is highly expressed in cluster three,

which are the subtype one.

Here is an example of simulated clustergram by random numbers.

In this clustergram, no distinct the clusters can be observed.

Red and blue colors just mix all together.

And that the column labels of g3 subtypes are also

expectedly

mixed. You cannot find order in it.

I always want to present a random figure, because the

tumor gene expression

data we used is quite good.

You can see clear patterns in it, but many data sets, will be noisy, and

fall

between the nice tumour cell data, and the simulated random data.

Though the clustergram may look amazing and complex

at first sight, its mechanism is quite simple.

In this and next few slides, I will explain how it works.

Suppose that we now have a to f, six gene expression profiles.

The left are their representations in a two dimensional PCA figure.

The question is, how we would like to cluster them?

Well, by eye, you may want to cluster bc together,

def together and leave a alone, but this is quite arbitrary.

So is there a way to rationally and computationally

cluster these data points ? Hierarchical clustering offers the solution.

4:19

Then how many clusters we

want depends on which level we want to set the cutoff.

If we set cut off here, we will only get two clusters.

Cluster A. Cluster BCDEF.

And if we set cut off here, we've

got three clusters Cluster A, Cluster BC, Cluster DEF.

And if we set the cut off to the

lowest level we will have our original six data points.

The denogram we saw in the Clustergram.

is just a compact representation of this

heirarchical tree-like structure after turned it upside down.

Above is the main idea of hierarchical clustering.

Here are some additional things you may

want to consider when making a clustergram.

The first topic is metric.

Metric defines how to measure the distrnce between two gene expression profiles.

The most common metric is the Euclidean distance.

Each gene expression profile is a vector of values.

And the Euclidean distance is calculated by the formula below.

5:22

I think most of you are familiar with this formula.

Besides Euclidean distance you can choose

cosine

distance, correlation distance, hamming distance and so on.

But most of the time Euclidean distance will do the job.

One special case may be, for example, you dataset

is binary and you may want to use hamming distance.

as your metric.

Because it is specially designed for binary data.

Look at this picture again.

You can see hierachical clustering is performed twice,

on both directions. Column wise and row wise.

These two clusterings are independent of each other because the order of components do

not matter when you compute the distance between two vectors.

If this doesn't make sense to you, don't mind.

Just remember that two clusterings are independent of each other.

The result is

that similar expression profiles are clustered together, and genes

that have similar expressions across all profiles are also clustered together.

For example, genes consistently highly expressed in

cluster two is clustered to together, like here.

The second topic will be the linkage function.

You need linkage function while you want to calculate distance between clusters.

Here is a simple example.

You want to calculate the distance between clustered

data point de and data point f.

6:59

There are a few options. The most common method is called Average.

In this method, we caclulate the distance between d and f and the distance between

e and f.

Now you use the average of the two distances

as the distance between this de cluster and this f.

Median methods we use the median of the distances.

And for single we use the shortest distance of the two and the complete

we'll use

the longest distance of the two. Here's one more example.

If you now what to calculate the distance

between cluster bc and cluster de using the single

method, you calculated distance between bd, cd and the

distance between be, ce and you've got four distances.

And you will find that the distance between c and d

is the shortest and then you will use this distance

as the distance between these two clusters.

One more thing to consider is standardization.

Standardization converts data into standardized z-scores.

Z-score means how many standard deviations away is a value from mean.

If a value equals to the mean plus 2 standard deviations, its z-score will be 2.

Standardization is a normalization process that forces the value to fall into

the range that is most suitable to be visualized in a clustergram.

8:24

There are two options, row standardization and column standardization.

Row standardization calculates the z-scores for each row and

column standardization calculates the z-scores for each column.

For gene expression data,

we generally use row standardization because we want to see

for each gene, how their expression values change across different conditions.

Okay, now we will begin our demo on clustergram in Matlab.

11:47

This command, however, looks too long and it's not easy to write.

Actually, many popular properties are already set by default, like

the metric by default is Euclidian, linkage is average.

So you can write the command in short as the one below.

In this command, you do not need to specify rowPdist, columnPdist and linkage.

Because Euclidian and average are already ready used by

default. So, this command looks nicer and

shorter. And it will do the same thing as I paste

it here. And run it, we got the same figure.

12:30

After you get this clustergram You can use this button to get a scale bar,

and, you can use this button to toggle

the denogram and this button to zoom

in, and this button to zoom out. After you

are in the zoom in mode, you can use this button to pan over the figure.

One nice thing about this clustergram is that you can

select a subset of the clustergram and copy it to a new clustergram. Then

you can examine this part of the clustergram in close detail.

Here I will teach you a trick to export clustergram in vector format.

First click Export Setup.

Change Rendering to Painters Vector Format and click Export.