0:09

Well what beta diversity is instead is this measure of the, of the similarity or

dissimilarity between two different samples.

And and we, we especially think about beta diversity in

ecological context as measuring the amount of change between different environments.

0:24

So in other words, the similarity between two different,

between the communities in different samples is now interspecies.

The higher the species diversity, the less similar the communities are.

For any two people, base diversity turns out to be relatively low in their mouths.

In other words, all of us have relatively similar microbes.

Well reality high on the gut.

In other words, different people have different microbes from one another.

1:06

You probably remember our friends from the last lecture on alpha diversity.

But when we're talking about entire communities and

the similarity and dissimilarity it gets a bit unwieldy to try to juggle them all so

we're going to resort to slides for this.

1:42

Therefore, the dissimilarity of beta diversity is higher between sample A and

sample B than between sample C and

sample D because they have fewer things in common.

We also need to

consider whether organisms that are shared are present in the same abundance.

2:06

However, in sample E the blue organisms make up about 73% of the community,

whereas they are only about 29% abundant in sample F.

We can also look at how closely related the organisms are between two communities.

2:30

In this case we also have the phylogenetic tree that describes

the evolutionary relationships between these organisms.

So in this case what you can see is that all three communities share one organism.

The dark blue one at about the same abundance of 50%.

However the light blue microbe is very closely related to

that dark blue one according to the phylogenetic tree.

Whereas, for example, if we look at the green microbe in the second sample,

that's much more distantly related.

2:58

Therefore, sample one is more similar to sample three than it is to sample two,

even though they share the same number of species.

It matters whether or not their species are closely related or distantly related.

3:18

This method of using the phylogenetic tree as a measuring stick to, to tell how,

how, how similar or dissimilar different communities are,

is a technique called unifrag, which Cathy Lozupone and I introduced back in 2005.

Cathy completed a really brilliant PhD thesis on this when she was a member of

my lab as a graduate student.

And she's now a faculty member at the Medical Campus where she now

works among other things on links between microbes and autism.

And we'll hear, and

we'll, we'll hear from her on that topic in an interview later in the course.

3:50

But right now, we're going to hear from another talented current member from my

lab, Will Van Treuren.

Will has a background in applied math and MCUB, a very unusual combination.

And and,and more recently he's been working on a lot of

fascinating microbial analysis.

Including pushing the envelope of how we can tell whether two microbes,

the relation together, in a, in a particular set of samples, and whether or

not they interact with one another to produce an interesting, to produce

an interesting effect, that neither of them could pull off on their own.

4:21

What he's going to tell you about now, though,

is he's going to walk you through some of the techniques that we use to visualize.

The similarities and differences and diversity between microbial [INAUDIBLE].

>> Recording data is one of the essential functions of the scientist.

In microbial ecology in particular, and data science in general,

the contingency table is widely used.

A contingency table has rows and columns.

The columns usually record the things found in a given sample.

And the rows record the specific thing or things observed.

The rows of a contingency table are often called the features of the data.

As an example I have taken four samples of some arbitrary environment, and

recorded the data.

5:04

Sample A has three green bugs, two pink bugs and two tan bugs.

So in the table, the first column, the one corresponding to Sample A,

has three in the green-bug row, two in the pink-bug row, and two in the tan-bug row.

A contingency table is really the work horse of data science in general, so

familiarize yourself with this concept, before moving on.

The phrase, a picture is worth a thousand words, is doubly true in science.

To make sense of contingency tables, and

data in general, scientists turn to various visualizations, or plots.

One of the simplest plotting schemes goes like this.

Treat each feature of the contingency table as a dimension or access and

mark where each sample would go based on its featured.

For instance, let's look at sample C.

It has three of the green but, so

we locate it at the three position on our green bug axis.

It has one of the pink bug, so

we locate it at the one position on the pink bug axis.

Finally, it has zero of the tan bug, so we put it at zero on the tan bug axis.

6:10

If you prefer to think of these axes as X, Y, and Z, we've located the point for

sample C at X equals three, Y equals one and Z equals zero.

The patterns a visualization shows helps scientists derive conclusions and

develop applications for their science.

For instance, in our microbiome studies we are frequently concerned with how

close someone is to developing a certain type of disease, like ulcerative colitis.

If we have a known disease sample, say Sample B, and two unknown samples D and

C, our intuition tells us that the closer the unknown sample is to be,

the more likely it is to come from a person who has that disease.

In many cases however, saying that sample x looks closer to sample y,

than to sample z is not rigorous enough for scientific or medical use.

To help us in these cases, we introduce the notion of distance.

7:04

Now, there are a lot of notions of distance that are mathematically complex.

But a familiar one to most of us is Euclidian distance.

The Euclidian distance between two points is just the square root

of the sum of the square differences in the locations on each axis.

7:40

Both A and B have two pink bugs, so we have 2 minus 2, and

finally B has two more tan bugs than A and we have 2 minus 4.

What's important to note, even if you're not fluent in math,

is that the distance between sample A and sample B, the green line,

looks smaller than the distance between sample B and sample C, the yellow line.

The distance calculation confirms our observation and

shows us that A is 3.61 units from B, while B is 5.09 units from C.

8:12

When we have three dimensions of data,

the visualization strategy we just outlined works great.

However, when we have more then three things it won't work.

Imagine these four samples are the same samples as before,

we have just looked harder through them to find some new bugs.

Now, instead of three type of bugs, or three features, we have six types of bug.

We can't plot these samples in the way we did before because we have no way to

visualize six dimensions.

Although we can't visualize this data, our notion of distance still works just fine.

In fact, it will work with any number of dimensions.

The bottom calculation shows the distance between sample A and sample B again.

But this time with the inclusion of the new features.

You might thing, well, if we have something that works, ie we can

compare these samples using distances, why do we need visualizations at all.

The answer is that we might never even develop the intuition that there is

a pattern in the data if we don't visualize it.

Scientists have been faced with the dilemma of how to

visualize high-dimensional data for a long time, and

they've come up with some pretty ingenious methods.

But one I'll discuss now is called dimensionality reduction.

In essence, when we are doing dimensionality reduction,

we are looking to recapture whatever patterns are in the data, but

reduce the number of dimensions we need to see that pattern.

We'll start with an example.

9:58

The Orange point is 3 units East and 3.35 units North.

While the Green point is negative 1 unit East and negative 4.38 units North.

The contingency table at the bottom left records this data.

Now, if you look on the right I have the same circle with the same points, but

I've recorded their positions differently.

10:30

If I gave you the representation on the right, it would be just as unambiguous as

the one on the left but I only used a single dimension of data, that is degrees,

rather than two dimensions of data, that is north and east.

This is the essence of dimensionality reduction.

Find a new coordinate system or

presentation of the data that captures the same patterns in fewer dimensions.

In microbial ecology, we frequently use a specific type of

dimensionality reduction called principal components analysis, PCA,

and a related technique called principal coordinates analysis called PCOA.

The math required for these techniques are basic linear algebra,

but it's well beyond the scope of this course.

Instead of slogging through that, we're going to give you a high level overview.

11:18

Imagine you have a 2D oval of paper that is tilted at

an angle from the horizontal like in panel A.

Now to describe a point on our oval we need an X, a Y, and a Z coordinate.

If you think about it,

though, the oval is only 2D so it could really lie flat in the plane.

It doesn't need a Z axis to describe it.

If we were to choose a new set of coordinates,

one along the long end of the major axis of the oval, and

one perpendicular to it along the minor axis of the oval we could

unambiguously represent any point on that oval using only two dimensions.

11:53

Panel B shows our axis in the original space with the new coordinate system, and

panel C shows the oval displayed only in 2D.

The oval actually was found in a smaller dimensional manifold of our

original coordinate system.

This is the essence of PCA and PCOA.

By choosing new coordinate systems, we can eliminate redundant or

useless systems and see the pattern in our data visually.

12:17

In reality, the process is not quite this easy.

Usually we can reduce the importance of a dimension but

we can't actually eliminate it.

In the context of this example,

this means that the oval has a bit of thickness to it.

12:36

To give you an example of how this is useful n the real world

I've included the plot of some human microbiome project data.

This data initially had thousands of dimensions so

to accurately graph sample relationship we needed thousands of axes.

By using the magic of PCOA however, we reduce that down to three dimensions,

which allowed us to see that all of the fecal samples are similar and

clustered on the bottom,

and that all of the oral samplers are similar and cluster on the top left.

In contrast, the skin and

vaginal samples while similar to one another, are more spread out.

Without PCOA, we'd never have seen this pattern, and we'd never have been able to

make some of the exciting discoveries that we'll talk about in the coming weeks.