An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

来自 Johns Hopkins University 的课程

Statistics for Genomic Data Science

123 个评分

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

从本节课中

Module 1

This course is structured to hit the key conceptual ideas of normalization, exploratory analysis, linear modeling, testing, and multiple testing that arise over and over in genomic studies.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

One of the most widely used and also one of the most widely abused techniques for

exploratory data analysis is exploratory data analysis with clustering.

So the idea here behind clustering is, can we identify data points that

are close to each other and then cluster those together into groups somehow?

And so, the first thing that you need to know about this is that it is

incredibly popular.

And so for example, this is a very highly cited paper that

discusses how do you apply hierarchical clustering to a set of data.

Before we can get into clustering though,

we need to be able to define how close two things are to each other.

And so as a simple example,

let's say we wanted to know the distance between Baltimore and Washington, DC.

So if we wanted to do that we could take the longitude measurements for

Washington DC and Baltimore and denote them by Y.

And the latitude measurements and denote them by X.

Then we could take the distance in longitude and

just by taking their difference and the distance in latitude.

And now we have two measures of the distance between Baltimore and DC,

if we wanted to combine them together,

one thing we could do is just add those two things up.

But it turns out sometimes this distance will be negative on the latitude and

positive on the longitude and they'll cancel each other out.

So you can square them.

That means that they'll always be positive distances but

now they're not necessarily on the same scale.

The scale that we care about.

And so you can take the square root to get something like the distance between

Washington DC and Baltimore that you might care of.

This is called the Euclidean distance.

And so Euclidean distance can be generalized even if you have lots and

lots and lots of genes.

You can take every single gene, take the difference between the two samples for

that gene, square them, sum them up, and take the square root.

And you'll get the Euclidean distance.

That's a way to measure distance between points.

Another way, especially when you're dealing with binary data,

is to look at something like the Manhattan or taxicab distance.

So the best way to think of this intuitively is imagine that you have two

places in a city.

You have this building here, and this building here, and

you want to get between them, you have to drive along the blocks.

So what's the distance between them?

Well we can measure the total number of blocks that you have to go

in the east west direction and the total number of blocks that you have to go

in the north south direction and that gives you the distance.

So you can do that by taking the difference in their

east west locations and taking the absolute values and

the difference in their north south locations and taking the absolute value.

One interesting thing about this distance metric is because everything is at a right

angle, as long as you follow any distance along blocks, the blue, red,

and yellow distances will all give you the same distance between the two points.

So now that we have a distance to find,

there's a couple of different ways that we can try to cluster points together.

So the first way is what was called hierarchical clustering.

And the basic idea is you start with the two nearest points, merge them together.

Then find the next two nearest points, merge them together and so forth.

So here's a really simple example.

Here I plotted some points with an X observation and a Y observation and

I want to look and see what are the clusters.

So you can kind of see from looking at the data right off that there's a cluster

down here in this corner, a cluster in this corner and maybe a cluster up here.

So the first thing that you do when doing hierarchical cluster is you find the two

points that are closest together and connect them.

In this case, it's points five and six.

So when we draw a line between five and

six representing the distance between those two points.

The next things that we need to do is find the distance, the next nearest distance.

But now it's a little bit tricky because points five and

six have sort of been merged together here.

So there are different ways that you can merge them together,

but one common way is to just take the average.

So you take the average y value and the average x value and

you get a new data point.

So now when I'm measuring the distance between seven and the cluster of five and

six, I measure the distance between seven and this center point.

So if I do that, it turns out that the next two nearest points are points 10 and

11, which are also very close together and so I draw a connection between them.

And then I continue going along doing this, and

if I find a point that say I want to connect the points 5 and 6 and 10 and

11, then I would draw a connection between these two groups of points.

So if I do this, I get what's called a Cluster Dendogram.

And so again, you can see, remember there were these three clusters we thought we

saw and it turns out they appear to be here in the dendogram.

And you apparently see them in the dendogram and the point eight was kind of

an outlier in the plot and you can see it's kind of an outlier here as well.

So a couple of things to keep in mind about this dendrogram.

One is the distance between two points

is defined by the distance along the line from one to the other.

So you can see that the distance between two and

three is closer than say from two to four.

But it's a little bit hard to read because you have to sort of follow the line

all around to get the distance.

Another reason why that makes it a little bit hard to read is because here we have

this dendrogram that looks like this.

So we have a dendrogram that has three components to it,

you can imagine they are these three clusters.

And one thing that you could do is if you label these one, two, and three, you

could just rotate around the axis and end up with a dendrogram that looks like this.

If you have three here, two here, and one there, it turns out, so

if I flip basically this one and

this one around this axis here, it turns out that distances don't change, and so

the dendrogram would mean the exact same thing.

So this is a choice that the programming language R makes to visualize

the data set.

But it turns out that, say for example this cluster is not necessarily

closer to the middle cluster than this cluster is to the far cluster over here.

It depends on the distance along the dendrogram.

All right, so that's hierarchical clustering.

Another way that you could do it is with what's called k-means clustering.

So in this setting, we again have this same data set, but

imagine we know in advance or we guess that there might be three clusters.

So one thing that we might want to do is say, okay,

where are the centers of those clusters?

And then we can assign points to the closest center.

So one thing you could do, you could just start off by guessing centers,

this isn't a very good guess in this particular case, but

in higher dimensional genomic data, often the best you can do is a guess.

So once you have those centers,

what you can do is you can just assign all the points to the closest centers.

So I calculate the distance, say from point 12 to each of the centers,

and it turns out that 12 is the closest to the purple center, so

I assign it to that cluster.

So eight and four are the closest to the red centers, so

I assign it to that cluster and one, two and

three are closest to the orange cluster so I assign them to that center.

Now that I've assigned them to these new centers,

I can recalculate where the centers should be, and so

it turns out the center between eight and four is right here.

And the center for these points is right here and the center for

these points is right there.

Then I can read you the same analysis again, I can basically recalculate.

Okay, now that the centers are here, here, and here,

what points are closest to which centers?

And so it turns out, if you keep doing this over and over again,

you will eventually end up with clusters that look something like this.

So you end up with a center over here in this cluster,

a center over here in this cluster, and the center over here in this cluster.

So what are we doing?

We're iteratively going between identifying where the cluster center is,

then assigning points to that cluster center,

then re-identifying where the cluster center is.

So this is called k-means clustering, it's another way to cluster the genes together.

And we'll see some examples of it in another lecture.

But, for right now, the thing to keep in mind is that this is a little bit

different than hierarchical clustering because in advance you have to define

the number of clusters that you want.

So, one thing to keep in mind with any kind of clustering technique is that you

need to be careful when exploring multivariate relationships.

So, the scale of the data matters a lot.

You might get different clusterings if you do different scalings of the data.

If you have different outliers, you can definitely drive a lot of the clustering.

For k-means, again, I showed a random start for the way the algorithm

would work, so the starting values can actually give you different clusters.

Finally, in the examples I showed, it was really easy to define the number of

clusters, but in general, this isn't a trivial problem to solve.

And it's better in general to visualize the data as much as you can.

Just like with almost every technique that we're using,

it's better to visualize them and check and see if they're useful.

One thing to keep in mind is that clusters that you identify in any kind of data

can be easily overinterpreted.

If the clusters are highly overlapping, they don't look like they're necessarily

very clear groups, you can really overinterpret them.

So you have to be very careful about looking at the clusters,

making sure that the relationships make sense, and not overinterpreting them.

Here's some resources that you might find useful if you thought

that this incredibly brief introduction to clustering wasn't enough.

I particularly recommend Hector Corrada Bravo's lecture notes.

And Rafa's Distances and Clustering video which are very good.

And if you want a more in-depth look, the Elements of Statistical Learning book has

a large section on clustering that might be useful.