So first, we need to get data.
The data's already available in a data library inside galaxy.
And so what we're going to do is go to galaxy, find a data library for
the Illumina iDEA datasets,
this is some RNA-seq data that was released from Illumina a couple years ago.
And I want to get the BT20 paired-end,
RAC seq subsampled, and we're just going to use one end of the data.
So, if we go over to Galaxy, I'm already logged in,
so you should go ahead and get logged in and get a new history.
In Galaxy, remember you can get a new history by clicking
the gear icon at the top of the History panel, and saying Create New.
And then we want to go to our data library.
So, go to Shared Data > Data Libraries.
We can use the search here.
We'll just search for Illumina.
And we have a data library here, Illumina ID data sets sub-sampled.
That's the one we want.
And the very first one here is the one we want, BT20 paired-end,
RAC seq subsampled end one.
And so you can say, just if you click on the button, you will have a menu and
you can say, Import This Data Set.
And it should select your current history by default.
And so go ahead and say Import.
Now you can say, Click on Analyze Data, and
you'll be back at the analysis interface, but now,
in your history, we have this paired end FASTQ data set.
And so, if you want to actually look at this data, you can start to get a feel for
what this format is like.
So, if you click the eye icon for the data set,
the first megabyte of that data set will load inside the main panel.
And so you can see that this is a data format that is based on having
records of four lines each.
Each record begins with an @ symbol and then an identifier, and then that's
followed by sequence data, then the plus sign and an encoded quality score.
So, Take another look over here at our key for the FASTQ format.
So, this is a record similar to the one we were just looking at.
You can see the header here is very different.
The header is essentially arbitrary.
There's some information about how the sequencing, and
what kind of instrument and such were used for sequencing, but
then you have your sequence data and your encoded quality score.
And there's keys available for
how the quality score is actually mapped into these ASCII characters.
There are actually a few different variations of this format,
depending on how they actually encode a range of qualities onto
the range of available characters.
These days, almost all FASTQ files that are being generated are going to
use this Sanger encoding.
But you may encounter other encodings if you're looking at older FASTQ data.
And so you want to be aware of how these qualities are encoded because it matters.
Okay. So now what we want to do is actually do
some assessment of the quality of this sequence data.
And the tool we're going to use to do this is something called FastQC.
This is a fairly complete package that will take a FastQ file and will give you
a number of different quality metrics and plots for the data there.
And so if you click on FastQC, it'll load up in the main panel here, and
we just want to run it on this single data set.
So make sure that BT20 paired-end is selected.
And there are other options available, but
we don't need them at this point, so we can just go ahead and say Execute.
And now we have two new datasets.
We have a report, which is going to be an HTML page that we can read through and
we have some additional data.
It's going to take a few seconds.
Okay, so that should take about a minute and
once it completes data set two here is an HTML formatted report.
So again, we can just click the eye icon to view it in the main window.
And so what we're going to get,
this is broken down into a number of different categories,
starting with basic statistics and then various other metrics that we can look at.
And for each of these,
we have some summary information and potentially, plots.
And so, the first one here is our per base sequence quality.
And so this is looking along the reads from beginning to end and
asking what the quality of each position is on average,
and showing you a box plot, summarizing the quality.
These reads are quite good.
One thing that you'll see, this is a Illumina data.
There's tendency for
the quality to decrease slightly as you go farther along the read.
But here, we have pretty much a median around Q40 which is quite good.
So, the data looks good.
There's some breakdowns based on where the actual sequenced data is on the flow cell.
We have a summary of overall quality per read.
Now, there are some things here.
One, this is the per base sequence content.
And you see here, there's some variability at the beginning of the read.
Now, one of the things we have to remember is we are working sample data here that's
been sub-sampled and so the overall complexity is lower.
And so this should also give us, although we don't have any over represented
sequences in this sample, so that's good.
However, we do have, if we look at the Kamer Content,
there are some Kamers that are more highly represented at the beginning of the reads.
But overall, this fast QC report gives us a way that we can look for
any problems that we have in these reads.
What do we do once we've actually inspecting some reeds if we see problems?
Galaxy is going to give you a lot of ways that you can filter or
trim reeds to deal with regions of low quality.
There are ways that you can trim from just the ends of the reads a fixed amount.
There are ways that you can look for a window with a certain level of quality.
We can filter entire reads out.
So, just to give you an example of some of these,
we have our Trim Sequences Tool.
So, this allows us to say okay, the first base I want to keep is, say one,
the last base I want to keep is say, 21.
And this will just extract the first 21 base pairs of every read.
So if you had a data set for example, where quality dropped off
a lot at the ends of the reads, you might want to trim it.
And we also have things like select high quality segments.
This will actually scan along each read and look for
a region of a certain length that has a quality score of a certain level.
We can.
Perform various other manipulations.
So for example, this is the manipulate FastQ tool that allows you to look at
various attributes of your read: sequence content, quality score, et cetera,
to determine if a read is going to be kept.
The quality trimmer again for FastQ data and so on.
So, the question of course is, what should you actually use for your data?
And the answer is simply that it's going to depend
on the downstream analysis that you want to do.
And so, here's a couple of different options.
So, one, we can do column trimming.
We can also do filtering where we're keeping or discarding whole reads.
This could potentially be costly if you have to filter out a large number of
reads entirely.
But, if you have an analysis that's going to be very sensitive
to having low quality regions, this may be the best option.
And again, the sliding window chart.
The problem with this is that it produces variable length reads, and
if there are some downstream analyses where you don't want
to have reads that are of variable length.
We really need to think about the downstream analysis
that we're going to do.
And so, this is a case where taking advantage of
the extensive documentation for
different analysis and different pipelines is going to be very helpful to you.
And so, we can suggest sources like SEQ Answers and Biostar as a way to
get more information on how exactly you want to process the data.
All right, so, to summarize,
there are many factors that can affect the quality of DNA sequencing data.
FastQC is one tool to allow us to evaluate different quality metrics for
FastQ data sets from different sequencing platforms.
Galaxy includes a large variety of FastQ manipulation tools.
And so, these can help you if you have low quality sequence data to salvage as much
of your data as possible.
But when filtering and trimming, you really need to think about the downstream
analysis that you're going to do and use that in selecting the right choices.