An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 约翰霍普金斯大学 的课程

Statistics for Genomic Data Science

116 评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 1

This course is structured to hit the key conceptual ideas of normalization, exploratory analysis, linear modeling, testing, and multiple testing that arise over and over in genomic studies.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

One of the most common used tools for exploratory analysis of genomic data is

clustering, so I'm going to talk a little bit about that.

So I'm going to set up my plotting parameters just like I do

in every lecture, and then I'm going to load in the packages we're going to need.

In this case the main one that we're going to be using

is this dendextend package to make some of the dendegrams prettier.

So now I'm going to load the body map, data set.

I'm going to load it from a connection.

Once I've loaded in the file, then I'm going to find an easy to work

with variable and I'm going to extract off the FENA type data,

the expression data, and the feature data.

So the first thing that I'm going to do, to make it a little bit easier to work

with, is I'm going to filter it out, just like we talked about on the transforms

lecture, the data that has means that are below some values.

So I'm going to say, I'm going to require the mean count to be greater than 5000.

And so if I do that, I get a data set with about 1,100 or so,

or 1,200 or so genes, and then I'm going to take the log2 transform.

And again, I'm going to add 1 so that I don't get those undefined values.

And so the first thing that I can do is I might want to calculate

the distances between either samples or between genes.

And so if I calculate this distance

between genes or samples, the first thing that it's going to do is it's going to

calculate a distance between rows.

So if I want the distance between samples, then I have to take

the transpose of that data set so that it basically puts the samples in the rows.

Then when I apply disks, it'll like calculate the distance in this case,

the Euclidian distance between each of the different samples, and so

then I've got this distance matrixes.

The distance between every pair of samples in the data set.

So once I have that, I can start doing some clustering on that,

so the first thing I'm going to do is I'm going to define a greater set of colors.

So again I'm going to define this color ramp, where I go from sort of pink to blue

through the intermediate of white and I'm going to define nine colors.

Then I can make a heat map of the data set as I did in the exploratory lecture

I'm just going to make a heat map here of the distance now.

Not the actual data itself,

I'm going to make a heat map of the distances between the samples.

And so by taking the distances between the samples, I can see that, so for example

right here on the diagonal line, that's the distance between a sample and itself.

Obviously that distance is very low.

But you can see some samples are closer to each other, these are like these white

chunk here versus the some samples are farther apart say, this 19th, 18th and

19th are pretty far apart, you can see that by the blue distance there.

And so, then what I can do is I can perform clustering on this,

so I use the H-clust function to do hierarchical clustering like we discussed

in the lecture on clustering, and so I can just apply that to the distance function,

and I get hierarchical clustering which I can then plot.

And so this shows which samples are close to each other sort of according to

the distance metric that we calculated.

If I want to make it so that it's a little easier.

I find it easier to read if you have all the labels at the same level and so

you can do that with hang equals minus 1.

Now you can see all of the samples sort of lined up in the same level.

The other thing that you might want to do is you might want to color the samples in

various different ways.

I'm going to talk a little bit about a couple of different ways you can color

these dendrograms.

To do that, I'm going to turn it into a dendrogram object,

the clustering we just did, and

then what we can do is we can color the labels in various different ways.

One way is to say, I just want to have four clusters.

And they're colored with colors, one to four.

And then if I plot that dendogram, you can see that it's now Now draw on the cutoff

basically at the place along the dendogram that splits this into four clusters.

And so it turns out that the place that that happens is right here

across this line.

So you end up with, once you do that cut,

you end up with one cluster that's this purple cluster, and

then one cluster that's this cluster here, and then this pink cluster here.

And so, you can tell it the number of clusters to make.

You could, if you, instead, wanted to see three clusters,

you could just change the number of clusters to three,

the number of colors to three, and make a plot.

And so now, she has broken it up into three different clusters,

instead of four different clusters.

Okay, so another thing that you might want to do is instead of

labeling the clusters according to some cut off here,

you might actually have a previously defined set of labels for the samples.

So you might know that say, the first ten samples come from one

case, and the last nine come from another different case.

And so what you can do is you can define labels, directly.

So it says, make the first 10, one and the second nine, two.

And then once you've done that, you can plot those by just seeing like here.

It's split into the two groups for you.

So you can either tell it to figure out how many clusters to color it by,

by telling it to use the color labels command, or

you can use this labels colors command to basically define which samples get what.

So that's a little bit about hierarchical clustering.

So now about k-means clustering,

was the other clustering technique we talked about.

So we can define that k-means cluster by using the k-means function.

So here I'm going to apply it to our filtered data set.

I'm going to tell it, I want three clusters.

So if I do that, it returns an object that tells me what cluster it belongs to, what

the centers of those clusters are, some information about how close the clusters

fit, and then some information about the model fitting process.

And so the first thing I might want to do is like see what the cluster

centers look like.

And so what I can do is I'm going to do matplot, which is basically going to

plot a sequence of each of the columns of this dataset.

And since I want the columns of the dataset to be plotted,

I am going to take the transpose of that centers matrix and

I'll plot out with the colors one to three since I told it there were three centers.

And what I see is, here are the three different cluster centers.

So you can see that there's one cluster that seems to be these points

in particular for these couple of samples in the middle are much higher then for

the other samples.

So that's the cluster centers.

There were three different cluster centers and so each gene is compared to each

of these three cluster centers, and it is assigned to the one that's closest to it.

So the other thing I can do is I can look at which cluster,

how many belong to each cluster, and so if I look at this k-means

one cluster variable, it basically shows which cluster each gene belongs to.

So if I make a table of that, that variable I can see that 396 of

the genes belong to this sort of third cluster.

739 belong to the second cluster and so forth.

So the other thing that I can do is I can reorder,

if I'm going to make a plot, I can reorder the data according to cluster membership.

So I can say, put all the, so what this does is if I run this command here

So, I'm saying take the e data matrix and

then I'm reordering the rows according to the cluster relationship.

And now I have this new data matrix, and I can do a heat map of that new data matrix.

And it will be a clustered version of this, where I've clustered it with

k-mean's clustering as opposed to a hierarchical clustering.

So now I can see that genes that are similar to each other have sort of been

clustered together.

So the other thing to keep in mind when you're doing this is it's

not actually an algorithm that's deterministic.

It begins with a random start.

And so if I recalculate those k-means clusters and

I make a table of the resultant cluster membership for the two cases,

you can see that in this case the ones that get called two are actually

called one in the other sample and called one are called two and so forth.

So in general, they don't always have to be the same and so

you can look at k-means documentation to see that if you want to

you can look at the total number of iterations that you'd like it to use but

more importantly, the number of starts that you'd like it to do.

The number of randoms, clusters, centers you'd like it to start with and

if you increase the number high enough then you'll get much, much,

more similar results between multiple runs or the k-means clustering of it.

So that's a little bit about clustering R.