0:01

Welcome back to Peking University MOOC: "Bioinformatics: Introduction and Methods".

I am Ge Gao from the Center for Bioinformatics, Peking University.

Let's continue this week's topic.

In the last unit, we introduced the deep sequencing technology,

its highlights of performance,and typical applications in biology studies.

In this unit, we will start with reads mapping and introduce the common bioinformatics methods

applied in deep sequencing data analysis.

As the name suggests, reads mapping refers to mapping DNA fragments

produced by sequencing (i.e. reads) to the genome.

Reads mapping has overcome the technical difficulties

resulting from the exceedingly short reads produced by deep sequencing.

In addition, it facilitates the use of genome coordinates as a bridge to integrate the sequencing data

with annotations produced from previous studies.

Reads mapping is often the first step of deep sequencing data analysis.

Its quality and speed will directly affect the following-up analyses.

Essentially, reads mapping is still a pairwise sequence alignment.

But the features of deep sequencing technology make it very different from the classical sequence alignment

which we’ve introduced before.

The first difference lies in the sequence length.

In the Needleman-Wunsch algorithm and Smith-Waterman algorithm, we assume that the two sequences are of similar lengths.

However, in reads mapping, the lengths of two sequences have a difference by orders of magnitude.

The length of reads is often no more than 100bp,

while the length of a reference genome is often hundreds of Mb.

The second difference is the data size.

Reads generated by deep sequencing can have a size of hundreds of Gb.

This will amount to the size of dozens of individual human genomes.

The third difference is the data quality.

In the pairwise sequence alignment, we usually assume that the input sequence has no errors.

However, the quality of the reads produced by deep sequencing is unevenly distributed,

and the reads are subject to a higher error rate.

Compared with the pairwise sequence alignment, the two sequences are different in the sense of their “roles”.

Reads can be considered completely embedded into the genome.

In fact, we need a hybrid alignment mixed with global alignment and local alignment.

For reads, it is global. For genome, it is local.

Therefore, we come up with a straightforward solution.

We just need to use the Hidden Markov Models mentioned with some modifications.

For example, we can reduce the probability of the transition to the gap state on short sequences (i.e. reads).

We can even remove such states.

This method can avoid a very big penalty on short reads,

However, the quality of the final alignment will also be affected

Besides, the computational overhead is very big.

We can consider the fact that, despite those reads coming from genes with multiple copies,

a given read often comes from (few) specific locations of the genome.

This implies that most attempts from other locations will be futile,wasting a lot of time

Therefore, we need to refer to the basic ideas behind BLAST.

We will use the seedling-and-extending strategy.

Specifically, we first locate the reads quickly by building indices for the genome.

Then we use standard dynamic programming to construct the final alignment.

The “index” is essentially a grouping of data.

The original data can be divided into several smaller groups

by applying index function to the keys in the original data.

The original data can be divided into several smaller groups by applying index function to the keys in the original data.

Restricting the search of a certain key within only one group can reduce the search space and thus the search time.

A good index function should try to balance the sizes of the data groups.

Such function can reduce the average search time better.

Hash is a very common indexing method.

The index address is set as the value of hash function for the corresponding key in the original data.

The expected time complexity for searching is constant, i.e. O(1).

For example, given a genome sequence, we can divide it into several seed fragments as BLAST does.

Then we compute for each seed the value of hash function as its address in the index table.

We will record in the address table this sequence and its locations in the genome.

In this way, we can quickly locate any seed in the genome

when it is found in the read.

To improve the sensitivity of reads mapping, it is often allowed to have some mismatches.

We can take advantage of the pigeon-hole principle

to split the read into non-overlapping blocks.

For example, we split a read into three non-overlapping blocks.

We can ensure that, with two mismatches,

there is at least one block that can be used as a seed matching perfectly to the genome.

This block can thus be used to search the index table to improve the speed of reads mapping considerably.

That's why many early reads mapping algorithms (e.g. ELAND, SOAP1, and MAQ) have adopted this idea.

However, it is obvious that allowing more mismatches when searching will lead to more and smaller blocks.

This will reduce the performance considerably.

Therefore, since 2009 the prefix tree and the suffix tree

commonly employed in data compression have been used to build indices by reads mapping softwares.

These data structures reduce the memory cost by combining shared sub-strings.

In this way, the search for the longest common sub-string becomes easier to implement.

For example, the commonly used BOWTIE, BWA, SOAP3 and other tools

have adopted Burrows-Wheeler transform based on suffix trees.

This transformation can extend the aligned sequence by one base each time.

The efficiency of memory usage and speed of alignment are thus improved considerably.

After the seed hits are found using indices,

we can extend outward from the two endpoints of each seed hit as BLAST does.

Then we can use dynamic programming to get the final alignment.

However, please note that the NGS [reads mapping] differs from searching databases with BLAST

in that the NGS suffers from a high error rate.

We need to consider whether a mismatch is in fact an artefact made by sequencing error.

Therefore, Li Heng and Richard Durbin proposed in 2008 the concept of Mapping Quality to handle this problem.

Specifically, let's first assume that all the mismatches observed are sequencing error artifacts.

The probability of this assumption being true will thus be

the multiplication of the probability of each single mismatch being made by sequencing error.

We define SQ as the logarithm transformation of this probability.

Then we can get for each alignment position an E value, the probability of erroneously mapping to that position

Please note that this E probability has taken into account both how the sequences are similar

and the quality of sequencing results.

Therefore, in practice the Mapping Quality, rather than the sequence alignment score,

issued to screen out expected locations of reads mapping.

After mapping the reads correctly to the reference genome, we can call the genetic variants.

Judging by the size of genetic variants, we can classify them into SNVs

(single nucleotide variants) which affect single base,

and SVs (structural variants) involving multiple bases.

The SNV can be further divided into substitution (which is just the SNP) and indel.

Structural variants can also be further divided into large scale insertions/deletions,

inversions, translocations, and CNVs(copy number variations), etc.

SNP is the most studied genetic variants in current genetic analyses.

Thus we will also introduce the basic ideas and methods

of calling genetic variants by the following example of calling SNPs

We will make some clarifications here to make it easier to interpret later.

Although the two are done together in practice,

SNP calling differs (in the strict sense) from genotyping,

as SNP calling will only infer which loci are occupied with variants,

while genotyping will start from SNP calling to further identify the genotypes and homozygosity of these loci.

So,

First, assuming the reads have been correctly mapped, we can check reads mapped to specific loci.

We can use the rule of thumb to do SNP calling and genotyping simultaneously by counting reads.

For example, after removing bases with unreliable qualities,

we can roughly treat the loci with alternative allele frequency less than 20% or more than 80% as homozygous loci.

Other loci with alternative allele frequency between 20% and 80% will be treated as heterozygous loci.

This rule of thumb performs, in fact, very good at loci that are deeply sequenced.

However the performance is poor at heterozygous loci and loci that are not deeply sequenced.

Also, this naive method strongly depends on the parameters

(e.g. the 20% and 80% mentioned before) chosen subjectively.

It is impossible to give objective criteria for the reliability of the result.

Therefore, in practice SNP calling and genotyping models that are probabilistic are used more.

We will illustrate that with a probabilistic model that is the most simple.

For a diploid genome, there can only be three possible genotypes for a locus: AA, Aa, and aa.

Assume that we observed at this locus k reads supporting the big A and (n-k) reads supporting the small a.

If the true genotype is AA, then the (n-k) reads supporting the small a must arise from sequencing error.

The corresponding probability is the multiplication of the probability of

a read being sequenced with error at this loci over all these (n-k) reads.

Likewise, if the true genotype is aa, then the corresponding probability is the multiplication of

the probability of a read being sequenced with error at this loci over all the k reads supproting the big A.

Finally, we can deduce from the property of probability that

the probability of genotyping Aa being true

is 1 subtracted by the sum of the previous two probabilities.

We can even improve the accuracy of genotyping further by using Bayes Formula

to calculate those probabilities as posterior probabilities

if we can obtain the prior probability of each genotype occurring in populations by some background knowledge.

Of course, genotyping models used in practice need to consider much more factors than this toy model needs.

Nevertheless the basic ideas behind them are the same.

Our TAs have filmed a special video for the commonly used

reads mapping tool BWA and the genotyping software GATK.

This video will explain the basic ideas and usages of these tools,

and is uploaded as a supplementary video.

Thank you, and that's all for this week.See you next week!