I’m honored to present our work here. I’m nervous but very exciting .

Our work is mainly about two papers: Needleman-Wunsch in 1970 and Smith-Waterman in 1981.

We primarily compared details of the original algorithms to the current ones and there might be some mistakes.

So I’d like to hear your advice after the speech.

This is the content of the speech.

Firstly, I’ll review the background.

Secondly, I’ll introduce the Needleman-Wunsch global alignment , its scoring function , gap penalty and the significance of a given alignment.

Thirdly, I’ll introduce the local alignment of Smith-Waterman. Finally I’ll make a comparison between the two algorithms.

Let’s start with the background.

The first question is why we do alignment. Then we see this saying which was cited heaps of times.

The goal of aligning two sequences is ,

to infer their functional or evolutional relationship, which provides conservative information and new ideas for research.

The second question is why we use computer to do sequence alignment.

The answer is that it is too slow to do alignment by hand and the computer had alrea dy been applied in scientific research at that time.

In addition, computer is efficient at multiple sequence alignment.

Also, I think most of you have tried aligning long sequences in NCBI which is also more efficient .

And the computer normalizes the process and makes it repeatable.

Now, the sequence alignment has also become routine especially after the development of PAM scoring matrix.

The third question is why we use dynamic programming.

The problem of sequence alignment itself has two features:

it could be divided into similar sub-problems, and a minimal substructure can be derived after optimization.

These two features apply to the requirement of using dynamic programming [efficiently].

Here is the scoring matrix from Needleman's original paper which is published in 1970.

Firstly we can see it is a two-dimension matrix.

The x and y axis are two sequence we are going to do an alignment respectively.

You may feel strange but in fact these abbreviations of Amino acid have been deprecated now.

As mentioned before, the scoring function was quite simple. They assigned 1 to the identical amino acid pairs and 0 to the non-identical pairs.

After initialization by such a scoring function, we can get this matrix and check at each step the score of cells [to be determined at that step]

The paper concluded that it was the simplest way to get the total number of matched amino acid.

We ignore all the zeroes in the matrix for convenience Then we will fill the matrix with the recursive formula in the paper.

I will explain why we use this formula until we get more cell values

There are some difference between this formula and Mr. Gao metioned in the video.

For example, if I reach this place, what shall I calculate next?

The paper said that by searching the row and the column which is highlighted in yellow, we got a maximun score in them.

So we need to search for the maximal score in the region highlighted in yellow, which is 4

By adding it to the original score of the cell, we can get the final score of the cell, which is 5.

We do the recursion from the bottom right to the top left. In this way, every score is the optimal score of the matrix.

Therefore, when we finish filling in the matrix, we will get the optimal alignment in these two sides, which is 8 here.

So do you find it different from what we’ve learnt in last class?

There is no initial gap penalty in this algorithm, right?

Now we only need to trace back from the endpoint. The score of the optimal alignment is eight.

It comes from the largest number in its nearby highlighted cells, which is seven.

So we can trace back by searching for the largest number in the lower-right region of each cell in the backtracking path.

If there are two largest numbers, trace them separately. In this way, we’ll trace back to the lower-right corner.

So when should we stop trace back?

The original paper said it would stop in the last row or the last column.

Why?

Because the sequence alignment will begin at a place that has the same amino acid and will stop when all the amino acids have been aligned.

We just need to ensure that it stops in the rightmost column or downmost row. Then we can get a complete alignment for the two sequences.

Actually, I think that it implies some of the basic idea of local alignment, which I will illustrate later.

Here we’ll introduce the scoring function，scoring matrix and gap penalty of Needleman-Wunsch Algorithm.

Since this paper was finished published at in 1970, before PAM 250 was introduced at 1978.

At that time, they’ve had no idea about the scoring function and scoring matrix.

Although in 1969 Dayhoff mentioned a similar concept.

The paper assigned 1 to the identical sequences, and clarified nonidentical sequences based on the codon variation.

Why they did this?

This was because at that time already knew all the amino acids are composed of three codons,

so he was looking for all possible amino acid substitutions

Both (amino acids) codons, each with three bases.

A possilbe outcome is that two corresponding bases are in codon .They think a mutation happened.

Another possible outcomes are that one correspongding in condon and zero corresponding base in condon .

The situation of zero corresponding base is unlikely to happen It is a rare event.So the homology is worst.

Then one corresponding base is OK, and two corresponding bases o are better.

So this paper recommended this method to give the penalty points.

The paper also said at that time, lots of attributes of protein ,as Professor Gao had mentioned ,can also be used for giving penalty points.

I can give many examples such as hydrophilicity, hydrophobicity, acid-base properties and a class of things, the location of the protein,

the protein sequence , position, the relationship between the longitudinal position and so on.

So the scoring function should be a complex function.

About gap penalty, this paper recommended a linear penalty, which means more gaps more penalties.

So pooling what I have mentioned, we get a general form of Needleman-Wunsch.

Here we can noticed that the first two lines is initialized with the penalty,

These two lines are gap penalties.

S (Ai, Bj) is the scoring function.

It seems that there is no obvious differences, except in the first step with a gap penalty of two lines.

In this situation, we need to scan a column and a row each time.

But in fact it is no need to scan so many times. Because the time complexity here is O (n3).

If we just scan its following three cells, The time complexity is O (n2), this will save a lot of time.

Here is an example

This is a nucleotide sequence, with no penalty we can see that such a result can be obtained.

Using the methods we mentioned earlier, this result can be obtained. Since the results are manually filled, so there may be errors.

I marked some of traces.We can see the results from two upper left 8. A bunch of complex traces here.

then you have to find out all the traces, and then pick the best out. Actually It is a very troublesome job.

Also we did a simulation.

Then there are seq 1 and seq 2. I swapped them and do the alignment again .However, the result is not the same by using the same program.

The first result gives five gaps, and the second result gives six gaps. They are both optimal, because the score is the same.

However the result is unstable depends on the order we inputed.

So in this case, if I give a penalty to them , then you can see the two results become to be identical.

Although two results are identical randomly, it happens to be selected from many different results.

At least you can say that giving penalty makes stability better and the given paths will be less.

Then I will talk about the significance evaluation.

An important question is that how to determine two sequence are real matched but not just a random event.

What if two random sequences matched or give similar results or scores?

The paper gives two significance test methods.

One of the method is comparing real-sequence alignment results with the distribution of random-sequence pair alignment results.

If the former result is significantly different from the distribution, there should be a real relationship .

The other method is comparing one of two sequence we aligned with many real-sequences selected randomly.

Then we get a score distribution and do a significant test in the use of our alignment score.

OK,now we com to local alignment.

The original algorithm in the 1980 paper is presented here.

Read the formula carefully you will find that two extra maximum functions and k ranges

which is different from the Smith-Waterman algorithm we use today. I will discuss it later.

This is a figure, which I think properly explained the local alignment algorithm.

In the figure we can find they compared the value in the whole row,column,diagonal cell and the zero.

The advantage of Smith-Waterman algorithm is exactly this extra zero,

which made it possible to find a local optimal result by setting the negative cells to zero.

Choose the maximum values from these cells and 0, assign it to the cell, fill the matrix, then we get the original Smith-Waterman method.

The method to find the trace is similiar (with global alignment), but there is an important difference.

The trace start from the cells with highest scores and end up with zero cells.

It is unnecessary to compare all these cells. To assign a cell (i,j), compare values in cell (i-1,j), (i-1,j-1), (i,j-1) and zero is well enough.

This is the Smith-Waterman algorithm we use today

Compared with global alignment (Needleman-Wunsch Algorithm), the Smith-Waterman algorithm has two features as local alignment.

Firstly, the Smith-Waterman algorithm is terminated by zero

Secondly, the mismatches must be negative scored,

otherwise all positive scores will make it difficult to find the optimal pathway from a mass of scores, just as mentioned before

Besides, the Smith-Waterman algorithm do have advantages in aligning highly conserved sequences,

for the algorithm is modified from an early LCS algorithm.

Smith-Waterman algorithm also guarantees an optimal alignment between the two sequences.

But when applied to some perfect aligned substrings separated by poorly aligned ones, the result may not be satisfied.

We can see that by aligning the sequences mentioned above with Smith-Waterman algorithm.

As we can see, the number of gaps reduces to five,

and the alignment terminats where Needleman-Wunsch algorithm said there is a gap at the end with two "A"

because the score in that cell has already reduced to zero.

Here is a simple comparison of these algorithms.

The simplified Needleman-Wunsch algorithm performs properly at sequence alignment.

But with gap penalty considered, the scores of the matrix become ruleless, which results in complex pathways.

Besides, it indicates the leftmost pathway as optimal rather than the expected one.

In contrast,

when applying Smith-Waterman algorithm in sequence alignment, the identical pathway stand out clearly, without any other pathways nearby.

That's why Smith-Waterman performs very well at local alignment.

Finally, thanks for the efforts of all our group members,

our supervisors and most importantly the pioneers who made the significant contributions!

Thank you for attention!