An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

Loading...

From the course by Johns Hopkins University

Statistics for Genomic Data Science

110 ratings

An introduction to the statistics behind the most popular genomic data science projects. This is the sixth course in the Genomic Big Data Science Specialization from Johns Hopkins University.

From the lesson

Module 4

In this week we will cover a lot of the general pipelines people use to analyze specific data types like RNA-seq, GWAS, ChIP-Seq, and DNA Methylation studies.

- Jeff Leek, PhDAssociate Professor, Biostatistics

Bloomberg School of Public Health

The most common [INAUDIBLE] analysis is to do an expressive quantitative trade locust

Â analysis, or analysis, where you're combining gene expression

Â information with genetic information to try to identify association.

Â So I'm going to show you an example of that using the matrix package.

Â So here I'm going to set my parameters like I usually do and

Â then I'm going to load the packages that we're going to need,

Â in this case the primary driver of most of this analysis is the MatrixEQTL package.

Â That means you can see there, but they're sort of strange caps,

Â you need to all caps for EQTL and caps for Matrix.

Â So then what I can do is I can basically load some data from this.

Â I'm going to load snp data, expression data, and covariant data.

Â The snp data is the genotype information, the expression data the gene

Â expression information, and the covariants are anything you might want to adjust for.

Â I'm going to do that by copying these lines here.

Â These are basically just finding out the location of the package.

Â So it's the base directory.

Â And then I'm going to paste together that directory to

Â basically different text files to basically pull them out.

Â So if I look at the base directory it's basically finds the base directory on my

Â computer and then the snip file is going to basically paste that together

Â with some downstream information that we might need.

Â Just make that txt.

Â And then it's going to say paste those together with no separation between them.

Â And so you can see it creates like a file name that goes straight to the snip.txt.

Â So then I can read those in with read.table.

Â So I'm basically going to read in the expression file with read.table

Â expression.

Â File name, I'm going to tell it that the separation is a tab separator.

Â It's got a header.

Â And it's has a real name that is equal to one.

Â I knew that because I'd already looked at the files in advance, but in general,

Â you might have to check and see if they had a header, or if they have a real name.

Â So, here you go.

Â So now we see the expression data.

Â We can do the same thing with the snips data, we can read it in.

Â Read it in from basically the R package itself, and

Â again this one has a header and it had real names.

Â So, I can do the same thing, so

Â you can see the genotype information is coded as zero one two.

Â The expression measurements are quantitive.

Â I can do the same thing for the covariants here.

Â And so I can look at the covariants which has information on the gender and

Â other information you might care about for the people, for example, their age.

Â Okay, so then, basically the idea behind an eQTL analysis is to do a linear

Â regression relating gene expression information to genotype information.

Â So we could do that one by one.

Â So here's how you'd do that.

Â You could basically take the first genes expression levels

Â by extracting them out from the first row of the expression matrix.

Â And you can extract out the first set of snps from the snps matrix.

Â So now add e1, s1, and now I just want to relate those two things together,

Â you could use a linear model.

Â And so, if you look at that You see that there might be a relationship here between

Â this gene's expression levels and SNIP because it has a relatively

Â reasonable effect size but the p-values too big.

Â So it's probably not highly associated.

Â So then what we can do is we can actually plot the data to see what this looks like.

Â So what I'm going to do is I'm actually just going to plot the expression levels

Â vs the genotype, so here I'm plotting expression level vs a jittered version of

Â the genotype so just so the points don't all land on top of each other and

Â then I'm going to color it by the genotype.

Â So you can see here they're colored by genotype and then

Â I'm going to add an access label so you can see which genotypes they are, and

Â then you can plot the fitted values versus the observed genotypes, and in

Â this case I'm going to color that in dark gray, and see here's the fitted values.

Â You can see that this is the average in the homozygous major the heterozygote,

Â and the homozygous minor In this case I've assumed an additive model

Â since I've just indirectly included the snip.

Â So you can see it's the same difference between the homozygous major and

Â the het, as the het and the homozygous minor allele.

Â So now we can set it up to do this on every single case.

Â because if you wanted to use LM, in this case the data sets are really quite small.

Â The expression data has only got ten genes in it, but in general,

Â you might have tens of thousands or hundreds of thousands of genes or

Â axons that you're testing, and equivalently, millions of snips, and so

Â you're doing billions and billions of regression models.

Â If you do that with LM, it will take a long time to do it, so

Â instead, we can use this matrix eQTL package.

Â So, the first thing we need to do is, we need to set up some parameters.

Â And so the P value threshold is at how statistically significant do you need it

Â a P value to be before it actually saves the output.

Â So it's basically going to throw away everything that just,

Â throw away the calculations for everything that are above the threshold and

Â that'll save a lot of computational time and space.

Â So you should choose this to be as liberal as you need to catch everything that

Â you think might potentially be interesting.

Â Even if It doesn't have to be just the ones that are at the most extreme

Â significance levels if you want to look at the broader distribution.

Â The error covariance, if you just set it to the numeric, that's basically saying

Â that they're after I kind of of the genotypes and the covariance.

Â There is essentially an independence error model for

Â the gene expression variation which is usually the most common assumption.

Â And then it tells you which model,

Â you have to tell it which model to use, in this case.

Â It's going to say, use the linear model, so just use the standard

Â additive linear model rather than trying to do a dominant and recessive model.

Â Okay, so then once you've got those general parameters set up,

Â you have to set up the files, matrix CTTO Because it's, it is very, very fast.

Â But to do that it has to sort of be organized about how it's going to

Â analyze the data.

Â And so here we have to create a new sliced data object, both the snips and

Â the gene expression information.

Â And so to do that we have to tell it what the file delimiter is.

Â So this is basically the file you're reading in.

Â If it's got Tab separated.

Â Then you give it a file delimiter of tab.

Â What to miss like if it's at what's the definition of a missing value?

Â How many rows and values to skip.

Â So in this case the data set had both a column

Â set of column names and a set of row names, so we want to skip one line each

Â because it's just going to test the values that are numeric.

Â And then the file slice size tells you something about,

Â it's going to basically break the files up into chunks so that it can be computed

Â on more easily in R, and so this tells you about how big the chunks are going to be.

Â Sizes of 2000.

Â The bigger the chunks, probably, the faster I can compute, but

Â the slower it is because it has to load more data in, so there's a balance there.

Â 2000 seems like a reasonable compromise here.

Â In this case, there's no need to chunk it up, because the data sets are so

Â small, and so then I just load the file name in.

Â Using all those parameters I just defined.

Â I do something very similar for the D expression data,

Â to create another sliced data object.

Â So here it's 15 snips and the ten genes and in the case I'm not going adjust for

Â any covariance, you can do the same thing with the covariance but I'm just

Â going to say It's an empty object, so we're not going to do any adjustment.

Â So then here's the main command for running a matrix_EQTL, so matrix_EQTL has,

Â gotta be careful about the capitalizing, capital M And lower case e capital QTL.

Â Then you pass it the snips slice data object, the gene slice data object,

Â the covariance slice data object.

Â If you want to,

Â you can have it directly output the output to a text file rather than output into r.

Â This can be useful because sometimes these computations,

Â especially if you have many genes or many snips can take a long time to run.

Â And so you might just want to get this running in the background and

Â then come back to it and have it output the results.

Â The p value threshold, what model to use, the error code variance.

Â In this case we want to be able to look at the p value distribution for

Â all the different snips.

Â This will slow it down a little bit and require a little more memory,

Â but it allows you to do a better analysis,

Â deeper analysis, so I suggest always lead P value equals true, and

Â then you can tell it whether to calculate the false discovery rate or not.

Â So this case, it ran very quickly, there's hardly any steps in genes, but

Â it's still incredibly fast even if you do many analysis.

Â And so the first thing you can do is you can sort of look at the object

Â that gets created.

Â And it saved the p-values for all 150 tests.

Â So there's 150 tests because there are ten genes and 15 snips.

Â And so, it's basically testing every possible combination of snips and genes.

Â So that's 150.

Â 10 times 15.

Â And these are the p-values from all of them.

Â Here, they're relatively flat so you see that there's probably not much

Â statistical significance, but also maybe there's no artifacts.

Â The ME object that comes out has several components to it.

Â It tells you how long it took to run and it tells you something about

Â the parameters that you use to run the analysis, so you can save those as well.

Â And then the all component tells you the number of EQTLs it calculated.

Â How many tests it did.

Â So, if I look at the number of tests,

Â it calculated 150 tests just like we talked about.

Â And then you can look at the number of EQTLs it found.

Â So, that's the number that passed that significance threshold.

Â And then, the next step is looking at the eQTLs themselves.

Â Here it is. It tells you which SNP, which gene,

Â what the statistic was, what the p value was, the false discovery rate, and

Â the estimate of the association statistic.

Â So the next step with this is to go back and check.

Â In general you gotta check to make sure that there are not any artifacts, to make

Â sure that the plots look reasonable so you can go back and plot the ETTLs after

Â you've done this and make sure that they really do appear to be associated.

Â But this is the first step in doing ETTL analysis is actually just doing

Â all the possible comparisons between all SNIPs and all genes.

Â Coursera provides universal access to the worldâ€™s best education,
partnering with top universities and organizations to offer courses online.