[MUSIC] Hello, everyone. My name is Pimlapas Leekitcharoenphon. I'm a Postdoc from the Research Group for Genomic Epidemiology at DTU Food. Before we actually start doing our analysis on Next-Generation Sequencing, I would like to give you introduction about Next-Generation Sequencing or NGS. And right now, the sequencing technology can be divided into next-generation sequencing and third generation sequencing. The differences between them, this next generation sequencing, you have to amplify your fragments before you start sequencing. But the third generation sequencing, you can actually sequencing directly to the single molecules or you can call third generation sequencing at a single-molecule sequencing. And this is the machine or sequencing platform that you can use in each of the technology. But I would like to focus only on the next generation sequencing for today. So, this is an overview of how you can do NGS sequencing. First of all, you have to do library preparations. And the first step of the library preparations, as I told you, amplify fragment sequencing. The first step, you have to fragment your DNA. Then, you ligate your fragment DNA with adapters. And what is actually inside the adapters? Adapters contain amplification primer that used for amplify the fragments and then, it contain sequencing primer for sequencing the fragment. And if in one run, you have more than one sample, you have to put barcode. The barcode is a sequence that's specific for each of the samples. So, it allows you to put more samples in one sequencing run. After you ligate the adapters to your fragments, then you can start amplify your fragments and then, put those fragments into the sequencing machine and start sequencing. Before we actually go and see what the data you will get from the sequencer, let me tell you a little bit briefly about, what is the standard format to store DNA information? The standard format to store DNA information, we called fasta files. In the fasta files, it contain two lines. The first line is a header that starts from > sign like this and then, you can put any ID. The second line is a sequence data. It can be in one line or multiple lines. And in one fasta format, you can put more than one sequence. You can start another sequence by starting a new line with a > and then, you put ID and sequence and so on. This is the standard format to store sequencing data that they call fasta. But this is not the format that you will get directly from the sequencer. What can you get from the sequencer? So, you have the DNA, your fragment DNA doing library preparation, you sequence the DNA, what you get? You're going to get a short sequences that we call reads. Or you can call raw reads and these what you're going to get from the sequencing machine by NGS technology. So, the raw reads that you will get is not going to be in fasta format but it's going to be stored in Fastq format. So, what is Fastq? It's actually Fasta or Fasta + quality scores. So, instead of having two lines like in Fasta, in Fastq, you're going to have 4 lines for 1 read or one sequence. So, what 4 lines? So, the first line of course is the header or the ID of the read. The second line is a DNA sequence. The third line is the optional line that you can fill in whatever. And the last line is the quality scores, that's why it's called Fastq because it contain DNA information and the quality scores, sequencing quality scores. So, Fasta contain two lines while Fastq contain four lines with the quality scores, that is the format that you will get from the sequencing machine. And for the NGS sequencing, you can sequence one direction, that's what you call single end reads. And you can sequence two direction that we call paired end reads. And when you do paired end reads, what you can have, you can have fragment gap. And we're going to have another term for paired end reads that we call insert size. So, insert size is the size of the reads together with the fragment gap, not include the adapter. So again, insert size mean, all the fragment not adapter. So, the reads and the fragment gap, that's the insert size. And the insert size vary according to the sequencing technology, some produce very small insert size, some produce very large insert size. So, now you know what the data look like from the sequencing machine. So, what you're going to do next for the Fastq data or raw reads? So, first of all, because like I told you, you normally have more than one sample into one run. So, in output sample contain barcodes that specific for each sample. So, first of all, you choose properly Barcoding or split the barcode from your raw reads but normally, it has been done by the sequencing machine, normally. And the next step is to trim the reads, because you have some reads with low qualities or some reads contain N. So, you trim the reads or the part of the read that contain low quality DNA all contain something N like this. So, you trim the reads to get only the good quality raw reads for the next analysis. So, how do we know if you have a good quality raw reads? So, we have some parameters that you can consider to see if you have good or bad raw reads. So the first parameter C is called coverage. So, the coverage is an average number of times of the raw reads that cover in your genome. Remember, you amplify your raw reads, your the fragments before sequencing then you, if your raw reads have been sequenced with good quality, you're going to have high coverage. So, how can you calculate the coverage? So, if you know number of reads, read length, and genome size, you can know the coverage. For example, my raw reads contain 5 million base pair, no, sorry, 5 million reads. And the length of the reads is a 100 bp. And the genome size of my genome is 5 Mbp. So, in this case, the coverage is a 100X. Meaning that, on average, 100 reads cover each position in your genome this is quite very high. Another parameter is similar to coverage, but we call breadth-of-coverage. So, you divide by the assembly size and the target size. So, I will tell you more about assembly. The thing is, you have raw reads and you try to assemble them into something that we call contigs. And then you sum up all the base pair in your assembly, and then you will know the assembly size. For example, when I do it assemble of my genome I got 4.9 million base pair but my target genome is 5 million base pair. In this case, the breadth-of-coverage is 0.98, nearly one. One is a perfect case, meaning that you sequence everything of your genomes. Another parameter that we call depth. The depth is number of reads that cover a particular nucleotide in each position of the genomes. So, you divide by the number of reads and the particular site, and then you will know the depth. Okay, so after you do splitting the barcode, you trim the low quality read. The next step, they're going to be two ways depending on your objective. So, first of all, you do De novo assembly, because the raw reads is very short, you try to merge them to make a longer sequences. And if you do that without reference genome, we call it De novo assembly. And if you want to do it with a reference genome we call, mapping to a reference genome. It depends on your objective because if you want to make a longer fragments of DNAs by De novo assembly, you normally do it for finding the genes or finding any genetic markers that you're interested in. But if you map the read to the reference genome, what we normally do, we normally do it for identify or detect any point mutations like SNPs for example. In order to do that thing, assemble or mapping the read to the reference genome, We don't normally do it by hand. Of course, we do it programs or softwares. And most of the softwares or tools for doing assembly or mapping the reads to reference genome normally, most of them they are in Unix or Linux, that you will need to use command line. Only a few of them that available in DOS platform that you can use in Windows and with the user interface to use it. But as I told you, a lot of the tools, they are in Unix and it looks like this. And of course as a biologist or microbiologist or anyone that don't have any math skills using computer command line in Unix is going to be very complicated and difficult to use any tools that you have to execute or run it through Unix. And that's why we in our group, we make our web-based tools for you to use. It's basically the same out of the tools that we have, it run on something like this but we make a web interface like a mask to cover all of these. So, you're not going to see all the screen like that, you're going to see a web-based tool that you just click, upload the data and get the output. And the good thing about the web-based tools of course it's platform independent. So, you can run it in Windows, in Mac, in whatever kind of platform on your computers, it's platform independent. And require only little computer resources because you just only send your data to the web-tools and all the computational tasks will be on the server, and it will send only output to your computer. So, you don't need to use your resources in your computer for any analysis. And it can be done everywhere. You can do it at home, you can do it at work, in any country as long as you have Internet connection. And the disadvantage, it would require a little bit of patience. That depends on how busy the server, if the server is busy you're probably have to wait for some time until you get the output. And you will know more about the tools in the next sessions. But this is to give you in detail, this is the example of the tools that we have for MLST, resistance genes, SNP calling and identify species from genomicepidemiology.org. Okay, let me come back to the De novo assembly, try to merge the raw reads into the longer sequences without any reference genome. So that's how we called it De novo assembly. And the tool for De novo assembly because different sequencer require different assemblers. For example, if you work with 454 or Ion Torrent machine, you probably need to use Newbler for the assembler. If you work with Illumina or Solid, you probably need a Velvet. So, it's different by the sequencing technology. Okay, so, what the program actually does is actually use this De Brujin graphs theory trying to merge the reads, try to see the overlap between reads and merge. The reads together into a longer DNA sequence that we call contigs. So, when you merge the read small one into a larger sequences, then we call contigs, okay? And if you remember during reads, the format for the raw reads is Fastq and once you do De novo assembly, try to assemble them to the larger sequences that we call contigs, the format in the contigs is not going to be Fastq, but it is going to be Fasta, okay? Raw reads, Fastq, contigs, Fasta or Fast a. And for the mapping, you just map your reads to the reference genome and then, you get the consensus sequence. Once you have the contigs, what you're going to do with contigs? Like I told you, you want to identify genes or any genetic markers that you want to look for. For example, we see the genes, housekeeping genes or whatever markers that you want to look for and that's going to come to the questions. How do I know if I have good contigs or not? I will say, the good contigs, meaning the contigs are large enough to contain the genes or genetic markers that you're looking for. So, how do I know if my contigs are large enough? So there are some parameters, but there are two parameters that you can easily look at it. So, the first parameters, number of contigs. So, when you do the De novo assembly, you're going to get a lot of contigs and then you can count how many contigs that you have. So, if you have a large number of contigs, meaning that your contigs likely to be short, and that's not good for you to find the genes. So, what do you want? You want a fewer number of contigs. The less number of contigs is better, meaning that your likely to have longer size of contigs. And that's good for you to find the genes that you're looking for. Another perimeter is we call N50. In N50, how you can know that you have different size of contigs, right? So, you saw the contigs by the size and in the middle, the contigs in the middle and the size of the contigs in the middle that's the N50. So, if you have N50 large or you have high N50 meaning that you're likely to have longer size of contigs. So, these parameters, they are correlated. If you have fewer number of contigs, you're N50 likely to be high meaning that, you have longer size of contigs. And that's good for you to identify any genetic markers that you want to look for. Okay, just to sum up what I tell you. First of all, you split the barcode, trim all the bad qualities of the sequencing that's what you called trimming. You do assembly, if you do assembly without a reference, we call it De novo assembly therefore, identify any genetic markers or genes that you're looking for. And if you have a reference genome, you map all the raw reads into the reference genome. That because you want to see any differences between your sample and reference genomes, or you want to identify SNP's. And that's all for the introduction of NGS, and thank you so much. [MUSIC]