A big welcome to “Bioinformatics: Introduction and Methods” from Peking University! In this MOOC you will become familiar with the concepts and computational methods in the exciting interdisciplinary field of bioinformatics and their applications in biology, the knowledge and skills in bioinformatics you acquired will help you in your future study and research.
Course materials are available under the CC BY-NC-SA License.

From the lesson

Sequence Alignment

Upon completion of this module, you will be able to: describe dynamic programming based sequence alignment algorithms; differentiate between the Needleman-Wunsch algorithm for global alignment and the Smith-Waterman algorithm for local alignment; examine the principles behind gap penalty and time complexity calculation which is crucial for you to apply current bioinformatic tools in your research; experience the discovery of Smith-Waterman algorithm with Dr. Michael Waterman himself.

Assistant Professor, Principle Investigator Center for Bioinformatics, School of Life Science

Liping Wei 魏丽萍, Ph.D.

Professor, Director Center for Bioinformatics, School of Life Sciences

Hello everyone. This is the supplemental video.

The main purpose for this video is to clarify some concepts and to give some advices for application.

First, I will clarify the concepts of homology and similarity.

Then, I will talk about similarity matrix, which is often applied as the scoring matrix in sequence alignment.

And in the end, I will introduce dot matrix, which is a visualized sketch for sequence alignment.

In biology, homology means two or more things have a common ancestor.

At the level of gene or sequence, especially in phylogeny studies, homology is sometimes classified into two types: orthology and paralogy.

Orthology indicates that two sequences in different species were actually the same sequence in their common ancestor in the past.

The separation occurred due to speciation event.

Paralogy usually indicates that two sequences in one species came from the same old sequence;

Sometimes paralogy also involves two or more species. The two sequences are formed due to (sequence) duplication event.

This is a schematic diagram for orthology and paralogy concepts.

Orthology comes from speciation events,

such as the animal-fungi speciation and human-worm speciation in the diagram.

If the gene is not lost during evolution, these speciation events will leave for each descent at least one copy of this ancestral gene. For example, the HB and WB in the diagram are orthologous.

For example, the HB and WB in the diagram are orthologous.

Paralogy comes from duplication events, especially the duplication in the genome within a species.

For instance, in the diagram, the A-B gene subfamily comes from a duplication event in its ancestral species.

And there are another two and one duplication events for this gene family in human and worm, respectively,

which has produced paralogs such as HA1-HA3 andWA1-WA2.

Now let's talk about similarity and identity. They are two different concepts.

Let's take amino acids as an exmple.

They can be divided into several groups based on their chemical properties, such as alkalinity/acidity, hydrophobicity, or aromaticity.

The amino acids within the same groups can be regarded as being similar to each other.

On the contrary, identity means a totally identical relationship. For example, A is the same to A, and G is the same to G.

Homology and similarity are closely related in biology.

We call two sequences homologs if they share a common ancestral sequence.

If the evolution time is short, the difference will be small, and the two homologous sequences often show similarity.

On the other hand, if the evolution time is long, there will be more difference, and their similarity will be low, sometimes even very hard to detect.

Homology often brings similarity.

Also, it is easy to get the sequence and measure the similarity. Therefore, we usually try to infer homology based on sequence similarity.

This idea is usually effective and feasible,

But it does not ensure an 100% accuracy, as convergent evolution may sometimes bring about similarity as well

Nevertheless, convergent evolution seems to be rarely observedat the sequence level.

We have clarified the concepts about homology and similarity.

The next question for bioinformatics is how to make computer search for the homology or similarity.

First, it needs a quantitative measure for similarity,

which turns out to be the similarity matrix, or the so-called scoring matrix for alignment.

For nucleotides, there are only 4 types of nucleotides for DNA or RNA.

Therefore, we often use as the scoring matrix a simple identity matrix,

which has the same positive score only on the diagnal [and 0 for other elements].

In phylogeny reconstruction, researchers usually use more complicated substitution models,transversion.

to better describe how the bases change during evolution.

For instance, we generally think the changes within pyrimidines or within purines are more frequent than changes between them.

In other words, transition happens more easily than transversion.

For amino acids, although their biochemical proporties provide some quanlitative similarity,

the universally accepted, quantitative similarity matrix comes from the evolutionary idea,

which uses real frequencies of transitions based on known multiple sequence alignments.

In 1978, Dr. Margaret Dayhoff published the PAM matrix.

She manually aligned several groups of sequences with less than 1% sequence difference,

counted the frequency for each type of amino acid changes, and constructed a matrix called PAM 1.

Then, she applied ideas and assumptions in Markov chain (which would be introduced in later lectures),

regarded PAM 1 as one transition step in evolution, did PAM 1 matrix self-multiplication, and got several scoring matrix,

such as the commonly used PAM 30 and PAM 70, for alignment between sequences with more differences.

Later, in 1992,

Dr. Steven Henikoff and Dr. Jorja Henikoof did similar work based on much more multiple sequence alignments.

They focused on some conserved segments in the sequences,

and worked out several BLOSUM matrices.

The most commonly used scoring matrix for protein alignment nowadays may be the BLOSUM 62 matrix.

For PAM matrix, why we can self-multiply PAM 1 to get scoring matrices for sequence alignment with more sequence difference?

This might be related to the ideas and assumptions of Markov chain.

We self-multiply the transition probability matrix, which will be briefly introduced here and may be introduced in detail in later lectures.

In this example on the slide, we only consider three types of amino acids, denoted as A,B,C.

If the transition probabilities for one evolutionaly step (1% sequence difference) are shown in the table,

how to calculate the transition probabilities for two steps?

Think about the transition probability for two steps starting from A and ending at A.taking one step to A,B,or C,

It is actually equal to the summation of the probabilities starting from A,

and multiplying the corresponding probabilities from A,B,or C one step back to A, respectively.

Similarly, the transition probability for two steps from A and to B,

is equal to the summation of the probabilities starting from A, taking one step to A,B,or C,

and multiplying the corresponding probabilities from A,B,or C one step to B, respectively.

Students who are familiar with linear algebra may have got the rules:

the calculation process containing probability multiplying and summation is just consistent with the definition of matrix multiplication.

That's why we say the square of PAM 1 is PAM 2.

Also, please note that PAM 2 does not mean 2% sequence difference.

With the addition of reverse mutation, the corresponding sequence difference is less than 2%.

Similarly, we can calculate PAM 30, PAM 70, or PAM 250.

Here is the PAM 1 in the original paper, and the approach to calculate PAM 250 using matrix self-multiplication.

The scoring matrix is obtained by taking the log odds of each element of the final transition probability matrix from self-multiplication.

The BLOSUM matrix is constructed in the 1990s.

BLOSUM 62 matrix is constructed from multiple sequence alignments

with identity no more than 62%,

Similarly, BLOSUM 80 is based on alignments with identity no more than 80%,

and BLOSUM 45 is worked out from alignments with no more than 45% identity.

So in real application, if your sequences are very different, you could try BLOSUM 45.

On the other hand, for the sequences with fewer differences, BLOSUM 80 could be used.

The most commonly used matrix is BLOSUM 62 matrix.

Given the similarity matrix, the next question is how to find out the optimal or suboptimal sequence alignments.

Here, I will mention the intuitional display method called dot matrix.

Dynamic programming has been introduced in the week 2 lecture. In my opinion, its nature is just enumeration.

It searches for the global optimal solution recursively by storing the current optimal result at each step.

BLAST will be introduced next class. Actually, BLAST will also take advantage of some ideas from dot matrix.

You can also get a dot matrix in NCBI BLAST's result.

The most basic dot matrix puts the two sequences on the left and the top of the matrix, respectively.

Then the cells corresponding to matched residues are marked with 1; other cells are marked with 0.

Two identical sequences will fill all the main diagonal cells of the matrix with 1.

Two sequences sharing reversed subsequences will leave in the matrix a segment parallel to the counterdiagonal of the matrix.

Gaps and mismatches will make the segments discontinuous.

Real bioinformatics softwares, such as dottup from EMBOSS package,allow the user to set some parameters such as word size,

making the plotting more meaningful in the biological sense.

The "word size" defines the minimum number of successive aligned residues needed to be regarded as an align.

The example in the previous slide assigns to the "word size" 1.

The dot matrix might seem similar to the dynamic programming matrix,yet they differ a lot in their content:

the dot matrix only focuses on whether the two words from the two sequences formed by the current several local residues match,

while dynamic programming will consider the optimal alignment (and its score) of previous subsequences [at each recursion].

We can see that a long perfect match in sequence alignment

will leave in the dot matrix a continuous segment parallel to the main diagonal.

The BLAST algorithm has taken advantage of this feature [of the dot matrix] to accelerate itself.

That's all for this supplemental video.Hope everyone will learn something from this video.Thank you.