Learn to use tools from the Bioconductor project to perform analysis of genomic data. This is the fifth course in the Genomic Big Data Specialization from Johns Hopkins University.

Loading...

来自 约翰霍普金斯大学 的课程

Bioconductor for Genomic Data Science

114 评分

Learn to use tools from the Bioconductor project to perform analysis of genomic data. This is the fifth course in the Genomic Big Data Specialization from Johns Hopkins University.

从本节课中

Week Two

In this week we will learn how to represent and compute on biological sequences, both at the whole-genome level and at the level of millions of short reads.

- Kasper Daniel Hansen, PhDAssistant Professor, Biostatistics and Genetic Medicine

Bloomberg School of Public Health

In this session, we will discuss RLEs or

run length in coded vectors from GenomicRanges.

Run length encoding is a way of representing very long vectors.

Where some elements are the same.

This is a form of compression of the very long vectors.

Whether or

not the compression works depends on the type of data we're looking at.

This type of compression is particularly interesting for

representing signal over the genome where the signal is only

non-zero in a small part of the genome.

That's true, for example, for

on a sequencing where we are only supposed to have signal over transcribe regions.

It's true for ChIP sequencing, and it's true for many other applications.

The standard example of an ollie in genomics is a covered vector.

A covered vector is something that comes out of high throughput sequencing,

and it details for each space in the genome,

how many short reads covers that particular base?

Okay, enough talking.

Let's see some examples.

It's going to be a little bit easier to see in practice.

Let's first, load the library.

Now, let's construct an ollie which we do with the capsule RLE constructor.

So we take a single vector.

And we get the RLE out.

So let's look at it.

So what it has done now is, it says, this vector starts off with 6 with the value 1.

And the value 1 is repeated 6 times.

Then the value 2 is repeated 5 times.

The value 4 is repeated 2 times.

And finally, the value 2 is repeated 1.

So here we can see, we have something called lengths.

We have something called values.

And the access them using one length, Of RL and

run value, Of RL.

This is just a way of compressing the vector.

And this is a compression, in this case here,

because we have taken 14 numbers and compressed them into 8 numbers.

So this compression makes sense,

if the same number occurs in the vector right after each other.

We can take our RL code and vector and

then convert it back to a normal vector term by s numeric.

We get the original vector out.

There's also something to make the confusion total.

There's something in base R called RLE which is

also run length encoded, a run length encoded vector.

But it has a completely different API.

So often you shall think about this.

You have your run length encoded vector which is now,

coverage of signal across inside genome.

And very often, you have one or two aberrations where you have a set of

genomic regions and you want to do something over that.

So let's simulate that.

You do this with the aggregate function.

But let me tell you an example.

We take an Iranges.

Let's say that start is going to be 2 and 8.

And let's say the width is equal to 4.

So this gives me a certain subset of the r lead.

And I can now aggregate.

I take my one and encode a vector.

I take my Iranges.

And as my function, I could take the mean.

I get two numbers back and

this is the mean of all the elements that are in the two different ranges.

So, it's a mean of, let's say we take our original vector.

We got that by saying as numeric of RL.

And we take the mean of vector 2 to 5, vectors 2 to 5.

Mean of that, and from 8 to 11.

So this could, for example, compute average signal

across a set of pre-specified genomic regions of a coverage vector.

You can construct a coverage vector out of Iranges, that can be useful sometimes.

So, let's say you have Iranges.

Let's make a new Iranges.

So I have these overlapping intervals, and I can say what's the coverage of this?

By which, I mean, how often for integer number 1?

How many of these ranges is a part of for integer number 2?

How many of these ranges is a part of.

And I get this run length encoded vector that goes like first,

1,2, and then it's 3 for a little while, and then 2, 1.

Because of the structure of our Iranges.

Of course, I could do this with GRanges and genomic intakes,

and we've got a lot of that a little bit later on.

Sometimes you have covered rate.

So you have an ollie.

And you really want to figure out areas where this vector are big.

And I want to get that back into like intervals.

The way you do that is you use a slice function.

So you can think of the coverage function as giving you kind of a functional on

a chetum and now we slice the top off.

So let's take our ollie from before.

And let's slice it at a value of 2.

So now, we get a view spec, but let's ignore the views for a moment.

We get a single integer from 7 to 14.

And these are all the positions.

And in this case, there's a single interval containing all of the system

whether vector is greater than 2.

Greater than or equal to 2.

We can do the same thing with slicing it at a little bit hold.

I often, will we get a much smaller interval.

So we got bag of use object.

And this is completely, the same, and either way, exactly ,the same as the views

object which we have discussed for BS genome objects.

So we have one potentially very big object with just a one length encoded vector.

Think about vectors across the entire genome.

And we basically, have intakes or

intervals which subsets this very long beta.

And we do that using this views constructor.

But now, instead of a view in a genome.

We have a view run length called a vector.

We can instantiate our own views, of course.

And, once we have a view, we have a set of views.

Let's make a set of views.

We can compute functions on these views here.

For example, let's say, we just want to take the mean of the view.

And this is basically,

the same as what we did with the aggregated sample [INAUDIBLE].

Let's do the same thing with GenomicRanges.

Instead of Iranges, not everything is in chromosome 1 and we do a coverage of that.

There was a little bit of quick typing, and we now get back something

called an RLE list, because we have one RLE for each chromosome.

In this case, we just have a single chromosome.

But if it was normal human data, we would have many more chromosomes or contexts.

But here it's a single one, and inside of that we have an RLE.

When we have genomic range like this on our RLE list like this.

We can, again, instantiate a view on it.

And now we give it, let's say,

Yeah, let's view it on the Granges we had before.

No, let's view it on like Granges.

Chromosome 1 [INAUDIBLE].

So now we get an error back.

It says that, the ranges list must be a RangesList object.

This is a type of object we haven't discussed from genomic ranges, but

aside from Granges.

There's this thing called RangesData, RangeData and

RangesList that really are more.

They were an attempt at providing a Grange like object.

It was by some accounts successful, but hasn't really stuck around.

We still see these objects being around here and there.

So the way really solve this is you.

So you take this Grange and you just like say,

okay, let this be, let this be a range of lists.

And now, we have this views list, and let's take what we see in chromosome 1.

And now, we have a views, exactly as we wanted from persistent 3 to 7.

It's a little weird the Granges doesn't work direct as input interviews.

I will consider that a kind of a balk and

I'm definitely going to discuss that with the Bioconductor people.

So now, we have learned about running the code of vectors.

We have learned that we can have them both on the integers, but

we also can have them on the genome.

We can also have run.

We don't just need to run length encoding of integers.

We can have run length encoding of logicals, or even characters,

or logicals.

Actually, come to speak of it, if we look at a GRanges.

You can see that the seek names of the chromosome names are.

We can now recognize that these are one length encoded vectors.

Because we often have all the ranges in chromosome 1 next to each other.

We learned about how to compute on them using aggregate and views.

[COUGH] We have learned how to convert a Granges into our lead by conference.

And we have learned we can slice off the comps.