Learn to use tools from the Bioconductor project to perform analysis of genomic data. This is the fifth course in the Genomic Big Data Specialization from Johns Hopkins University.

Loading...

来自 约翰霍普金斯大学 的课程

Bioconductor for Genomic Data Science

115 评分

Learn to use tools from the Bioconductor project to perform analysis of genomic data. This is the fifth course in the Genomic Big Data Specialization from Johns Hopkins University.

从本节课中

Week One

The class will cover how to install and use Bioconductor software. We will discuss common data structures, including ExpressionSets, SummarizedExperiment and GRanges used across several types of analyses.

- Kasper Daniel Hansen, PhDAssistant Professor, Biostatistics and Genetic Medicine

Bloomberg School of Public Health

In this class, we assume some basic familiarity with R.

In this session, however, we'll discuss some basic types of objects in R such

as vectors and matrices that we'll see again and again in Bioconductor and

where it's important to have some basic understanding of how to subset,

how to operate on these objects.

So there's a big class of objects in R called atomic vectors, or vectors.

Atomic vectors are vectors where every single element is of the same type.

For example, a sequence of numbers from one to ten.

Vectors, all vectors can have names.

And we can see what the class is, in this case an integer.

And we can subset vectors with a one dimensional subsetting.

That means we have a single set of intake vectors.

Or we can give it a set, now that it has names, we can index it based on its names.

Names do not have to be unique on vectors.

That's a source of frequent problems.

Let's take an example here where we set the names(x) to be c("a", "a","b").

So we have nonunique names here and we will say x,

give us the element of x with the name a.

We only get a single element back, not both of the elements.

We get the first match.

That can be very confusing.

And you can check whether there's duplicated names in vectors using unique

or duplicated which are base R functions you should learn about.

But there's also a little convenient function called anyDuplicated.

It returns either 0 or the index of the first duplication.

In this case it's the second element in the names vector

where we see the first duplication.

So getting a good hit back is getting anyDuplicated equal to zero.

Let's try to say that.

And now we get 0 back.

Now, in R,.you can in R, there's a slight difference between integers and numerics.

Numerics allow you to hold numbers of arbitrary precision and

integers are standard integers.

That's a little confusing and

it's something I would prefer not to spend too much time on.

But we are going to, in Biconductor sometimes, run into limits associated

with integers, so I feel we have to have a brief discussion about it.

Let's say that we set x equal to 1.

This looks like an integer, But

when I do it like this, I actually get a numeric back.

Unlike if I set x equal to this function here where I get a set of integers.

So how do I represent a single

integer on the command line?

I say x = 1 and then I put a capital L afterwards.

That's something you just have to know that this is how you do it.

So here we get a class x, and we get an integer back.

Now there's a limit to how big integers you can represent in R.

It's given by a machine constant.

You get various machine limits out by writing .Machine.

And this case, yes, integer.max.

It's a big number, it's equal to 2 to the 31st minus 1.

You can check whether that's equal to the Machine.$integer.

It's true.

And what is it roughly equal to?

Let's round it.

Let's divide it by ten, by a million and let's round it.

So it's roughly equal to 2.1 billions.

This is a number that's slightly smaller, or

smaller than the size of the human genome.

And this is why we're going to run into limits associated with vectors.

There's something in R called long vectors,

which are, let me back up a little bit.

This is a limit for how big numbers you can have in an integer.

It's also a limit for how long a vector can be in R.

So you can't have roughly more than 2 billion,

2.1 billion elements in a vector unless it's something new called

a long vector that R has introduced recently where we still are waiting

to get kind of pervasive support for long vectors through all functions in R.

So you are sometimes going to run into this limit.

And it's basically if you add up a lot of bases from the human genome and

you get up to more than 2 billion.

And the solution is usually you convert the integers into numeric using

a function called as.numeric.

And basically this will fix it.

Let's now discuss matrices.

Matrices in R are another are very important fundamental class.

And matrices are constructed using the matrix function.

They are two dimension, and we give it a number of rows and

a number of columns, number of rows, and now we get a 3 by 3 matrix back.

It can have row names and column names, and

like vectors,

the row names and the column names does not have to be unique.

And like with vectors, having non-unique row names or

column names is a frequent source of errors.

A matrix is two-dimensional.

You get the dimension using dim.

And you can also get separately the number of rows and the number of columns.

Oh, the number of columns.

You subset matrices using two-dimensional subsetting, because it has two dimensions.

We can say, we can get a set of the first dimension is rows.

The second dimension in columns.

This set gives us the first two rows This here

will give us the first two columns, and we can do both of them at the same time.

We can also subset using characters.

So here we get the first row back.

And note, we don't get the first row back as a matrix.

We get it back as a vector.

Sometimes you really want to have a one-dimensional matrix or

a matrix with a single row and three columns back in this case here.

And you do that by using this hidden argument, I would call it a hidden

argument, to the subsetting operation where I can say drop=FALSE.

And now we get 1 by 3 matrix back.

You can also subset matrices uses matrices.

And a typical case is, I have a matrix and

I want to subset it using some logical operator.

So for example I can say give me all the elements in the matrix

where the matrix is greater than 5.

And I get a vector back.

So matrix subsetting like this, it loses the dimensions of a matrix to get a vector

back, which kind of makes sense if you reflect a little bit about what happens.

Matrices are, behind the scenes, just really long vectors.

So, unless you do use long vectors,

there's also a limit to how big matrices can be.

Matrices are column first ,so that means when you created the matrix,

let's go back up here, we got 1 to 10, 1 to 9, we fill up the first column,

then we fill up the second column, then we fill up the third column.

We can do the same thing and then we can say byrow=TRUE.

So now we're going to fill it up by row and now we fill up by row.

The next basic object in R is something called a list.

And a list is just a list of different objects.

There's no requirement that the objects are even of the same kind, so

we construct it using the list operator, let's let it be a named list.

The first thing we are going to check some random numbers.

The next thing is going to be some letters, pick out five of those.

And then the last thing could be a function.

Let's say, mean.

So now we get a list out of three objects, we have three numbers,

we have five letters, or five characters, and we have a function.

So there is no requirement that they make sense in any way.

Since they can have names, we can refer to these with the names.

Let's back up for a moment.

Subsetting with matrices works in many ways like a vector.

You can think of a list, in a way, like a vector.

We can get the first two elements.

We can even get the first element.

And notice here that when we ask for

the first element we get back a list with one element.

In order to get the element itself we have to use double brackets.

So that gives us directly the vector,

whereas the x with one bracket, it gives us a list with one element in it.

There's a crucial difference between these two things and

it pays to get your head wrapped around this.

You can also, when it has names, you can refer to the components in the matrix by,

in the list by the names.

Of course we'll do it in the classic way like we do with vectors, but

you can also the dollar operator.

One thing that's a little bit surprising is that you can,

you have partial matching when you do these things here.

So let's say here names(x), let's call them c("a").

We'll call that "letters", "c." So

I can write x$letters but I can also write x$let.

Nothing is called x$let but it does partial matching.

And that can be a little bit dangerous, especially if

we have two names that are almost the same,

I write x$let, and I get null back.

This is very, very, confusing at first,

and it's also another possible source of errors.

One way to get around this is you use the single bracket notation.

If you do the single bracket notation you don't have partial matching.

Sometimes you want a list where each element of the list is a single number.

And you do that using the as.list function.

You give it a vector of three numbers and

back we get a list with one element in each component of the list.

List can be, for list we have, very often have list that,

where each element in the list is of the same type.

Let's make an example.

rnorm(3), so here we have a list were

each element in the list is a vector of numbers.

And we want to use the same function on each of the element.

This is something we do all of the time in Bioconductor.

The basic function for that is called lapply.

We take the list and we apply some function to it, and we get a list back,

where each element of the list is the function applied on each element.

Quite often in this particular case, where the function returns a single number for

each case, we often want it really as a vector instead of a list.

One way of doing that is by unlisting the list.

Or, more frequent, you use sapply, which stands for

simplify apply, and it really just simplifies the output.

It requires that the output, checking on each element of the list is

kind of either a number or vector of the same length.

The final topic we are going to discuss are data frames.

Data frames are fundamental for data analysis and

it allows us to hold observations of different types.

So let's say we have a variable called sex, let's say.

Let's say I'll just execute it from over here.

We have a data frame that has two variables, sex and age.

Let's look it.

We have sex and we have age.

So this is a type where we can have different lists.

We can have elements or components of this object that has different types.

This would not be possible with a matrix.

Data frames are naturally kind of column oriented.

We can get their columns by using x$sex.

But we can't say, let's say there were rows names on the matrix,

you can't use the dollar notation to get the rows.

This is because frequently what we want in data frame is get a single column.

The way we think of data frames is that the columns are variables and

the rows are observations or different samples.

We can use single brackets and double brackets here to get

a specific column, but we can also subset the matrix or

the data frame in exactly the same way we have used for matrices.

One specific thing with data frames is that

the row names are required to be unique.

And that's actually kind of nice.

It also is something that slows down data frames when you have very big data frames.

But it's pretty nice.

They have to have row names and they have to be unique.

But the row names can be 1, 2, 3, as in the example we have here,

where you can kind of think of them as not really having row names.

Underneath it all, a data frame is really a list where each component in the list is

of the dame length.

And that means we can do a couple of tricks, like we can

use sapply and lapply on data frames and we apply them to different columns.

So something I use a lot is sapply class over the data frame in

order to figure out what class is each column.

And I get back that sex is a vector and age is a numeric.

The final subject of this assessment is going to be conversion.

We are frequently in the case where we have an object of a specific type and

we want to convert it into another type.

So let's take the data frame here.

Let's say we wanted the matrix out of the data frame.

We can use s.matrix, which casts the data frame as a matrix.

Notice it's a matrix now, but

it's a character matrix, because it cannot convert it into a numeric matrix

without losing the first column, or the first column would be N/A.

In the same way, I could convert the matrix into a list.

I get a list that looks like this.

And in this way, there's a lot of as.something functions,

which are part of base R.

These are useful and used again and again.

Now, in Bioconductor, we often have very complicated objects and

we kind of want to do the same thing but for very complicated objects.

And for that, we don't have as.something function.

But we have a general function that's inside the methods package.

And the function is just called as.

And the way you use it is as, object and whatever you want to cast it as.

So this here is very similar to the as.matrix, but it works for

very general types of objects.

So matrix could be some of the many different new types of objects we'll

learn about in Bioconductor, and we can cast between them using the as function.