An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

111 ratings

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 1

This course is structured to hit the key conceptual ideas of normalization, exploratory analysis, linear modeling, testing, and multiple testing that arise over and over in genomic studies.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

One of the most widely used and also one of the most widely abused techniques for

Â exploratory data analysis is exploratory data analysis with clustering.

Â So the idea here behind clustering is, can we identify data points that

Â are close to each other and then cluster those together into groups somehow?

Â And so, the first thing that you need to know about this is that it is

Â incredibly popular.

Â And so for example, this is a very highly cited paper that

Â discusses how do you apply hierarchical clustering to a set of data.

Â Before we can get into clustering though,

Â we need to be able to define how close two things are to each other.

Â And so as a simple example,

Â let's say we wanted to know the distance between Baltimore and Washington, DC.

Â So if we wanted to do that we could take the longitude measurements for

Â Washington DC and Baltimore and denote them by Y.

Â And the latitude measurements and denote them by X.

Â Then we could take the distance in longitude and

Â just by taking their difference and the distance in latitude.

Â And now we have two measures of the distance between Baltimore and DC,

Â if we wanted to combine them together,

Â one thing we could do is just add those two things up.

Â But it turns out sometimes this distance will be negative on the latitude and

Â positive on the longitude and they'll cancel each other out.

Â So you can square them.

Â That means that they'll always be positive distances but

Â now they're not necessarily on the same scale.

Â The scale that we care about.

Â And so you can take the square root to get something like the distance between

Â Washington DC and Baltimore that you might care of.

Â This is called the Euclidean distance.

Â And so Euclidean distance can be generalized even if you have lots and

Â lots and lots of genes.

Â You can take every single gene, take the difference between the two samples for

Â that gene, square them, sum them up, and take the square root.

Â And you'll get the Euclidean distance.

Â That's a way to measure distance between points.

Â Another way, especially when you're dealing with binary data,

Â is to look at something like the Manhattan or taxicab distance.

Â So the best way to think of this intuitively is imagine that you have two

Â places in a city.

Â You have this building here, and this building here, and

Â you want to get between them, you have to drive along the blocks.

Â So what's the distance between them?

Â Well we can measure the total number of blocks that you have to go

Â in the east west direction and the total number of blocks that you have to go

Â in the north south direction and that gives you the distance.

Â So you can do that by taking the difference in their

Â east west locations and taking the absolute values and

Â the difference in their north south locations and taking the absolute value.

Â One interesting thing about this distance metric is because everything is at a right

Â angle, as long as you follow any distance along blocks, the blue, red,

Â and yellow distances will all give you the same distance between the two points.

Â So now that we have a distance to find,

Â there's a couple of different ways that we can try to cluster points together.

Â So the first way is what was called hierarchical clustering.

Â And the basic idea is you start with the two nearest points, merge them together.

Â Then find the next two nearest points, merge them together and so forth.

Â So here's a really simple example.

Â Here I plotted some points with an X observation and a Y observation and

Â I want to look and see what are the clusters.

Â So you can kind of see from looking at the data right off that there's a cluster

Â down here in this corner, a cluster in this corner and maybe a cluster up here.

Â So the first thing that you do when doing hierarchical cluster is you find the two

Â points that are closest together and connect them.

Â In this case, it's points five and six.

Â So when we draw a line between five and

Â six representing the distance between those two points.

Â The next things that we need to do is find the distance, the next nearest distance.

Â But now it's a little bit tricky because points five and

Â six have sort of been merged together here.

Â So there are different ways that you can merge them together,

Â but one common way is to just take the average.

Â So you take the average y value and the average x value and

Â you get a new data point.

Â So now when I'm measuring the distance between seven and the cluster of five and

Â six, I measure the distance between seven and this center point.

Â So if I do that, it turns out that the next two nearest points are points 10 and

Â 11, which are also very close together and so I draw a connection between them.

Â And then I continue going along doing this, and

Â if I find a point that say I want to connect the points 5 and 6 and 10 and

Â 11, then I would draw a connection between these two groups of points.

Â So if I do this, I get what's called a Cluster Dendogram.

Â And so again, you can see, remember there were these three clusters we thought we

Â saw and it turns out they appear to be here in the dendogram.

Â And you apparently see them in the dendogram and the point eight was kind of

Â an outlier in the plot and you can see it's kind of an outlier here as well.

Â So a couple of things to keep in mind about this dendrogram.

Â One is the distance between two points

Â is defined by the distance along the line from one to the other.

Â So you can see that the distance between two and

Â three is closer than say from two to four.

Â But it's a little bit hard to read because you have to sort of follow the line

Â all around to get the distance.

Â Another reason why that makes it a little bit hard to read is because here we have

Â this dendrogram that looks like this.

Â So we have a dendrogram that has three components to it,

Â you can imagine they are these three clusters.

Â And one thing that you could do is if you label these one, two, and three, you

Â could just rotate around the axis and end up with a dendrogram that looks like this.

Â If you have three here, two here, and one there, it turns out, so

Â if I flip basically this one and

Â this one around this axis here, it turns out that distances don't change, and so

Â the dendrogram would mean the exact same thing.

Â So this is a choice that the programming language R makes to visualize

Â the data set.

Â But it turns out that, say for example this cluster is not necessarily

Â closer to the middle cluster than this cluster is to the far cluster over here.

Â It depends on the distance along the dendrogram.

Â All right, so that's hierarchical clustering.

Â Another way that you could do it is with what's called k-means clustering.

Â So in this setting, we again have this same data set, but

Â imagine we know in advance or we guess that there might be three clusters.

Â So one thing that we might want to do is say, okay,

Â where are the centers of those clusters?

Â And then we can assign points to the closest center.

Â So one thing you could do, you could just start off by guessing centers,

Â this isn't a very good guess in this particular case, but

Â in higher dimensional genomic data, often the best you can do is a guess.

Â So once you have those centers,

Â what you can do is you can just assign all the points to the closest centers.

Â So I calculate the distance, say from point 12 to each of the centers,

Â and it turns out that 12 is the closest to the purple center, so

Â I assign it to that cluster.

Â So eight and four are the closest to the red centers, so

Â I assign it to that cluster and one, two and

Â three are closest to the orange cluster so I assign them to that center.

Â Now that I've assigned them to these new centers,

Â I can recalculate where the centers should be, and so

Â it turns out the center between eight and four is right here.

Â And the center for these points is right here and the center for

Â these points is right there.

Â Then I can read you the same analysis again, I can basically recalculate.

Â Okay, now that the centers are here, here, and here,

Â what points are closest to which centers?

Â And so it turns out, if you keep doing this over and over again,

Â you will eventually end up with clusters that look something like this.

Â So you end up with a center over here in this cluster,

Â a center over here in this cluster, and the center over here in this cluster.

Â So what are we doing?

Â We're iteratively going between identifying where the cluster center is,

Â then assigning points to that cluster center,

Â then re-identifying where the cluster center is.

Â So this is called k-means clustering, it's another way to cluster the genes together.

Â And we'll see some examples of it in another lecture.

Â But, for right now, the thing to keep in mind is that this is a little bit

Â different than hierarchical clustering because in advance you have to define

Â the number of clusters that you want.

Â So, one thing to keep in mind with any kind of clustering technique is that you

Â need to be careful when exploring multivariate relationships.

Â So, the scale of the data matters a lot.

Â You might get different clusterings if you do different scalings of the data.

Â If you have different outliers, you can definitely drive a lot of the clustering.

Â For k-means, again, I showed a random start for the way the algorithm

Â would work, so the starting values can actually give you different clusters.

Â Finally, in the examples I showed, it was really easy to define the number of

Â clusters, but in general, this isn't a trivial problem to solve.

Â And it's better in general to visualize the data as much as you can.

Â Just like with almost every technique that we're using,

Â it's better to visualize them and check and see if they're useful.

Â One thing to keep in mind is that clusters that you identify in any kind of data

Â can be easily overinterpreted.

Â If the clusters are highly overlapping, they don't look like they're necessarily

Â very clear groups, you can really overinterpret them.

Â So you have to be very careful about looking at the clusters,

Â making sure that the relationships make sense, and not overinterpreting them.

Â Here's some resources that you might find useful if you thought

Â that this incredibly brief introduction to clustering wasn't enough.

Â I particularly recommend Hector Corrada Bravo's lecture notes.

Â And Rafa's Distances and Clustering video which are very good.

Â And if you want a more in-depth look, the Elements of Statistical Learning book has

Â a large section on clustering that might be useful.

Â Coursera provides universal access to the worldâ€™s best education,
partnering with top universities and organizations to offer courses online.