Welcome back to Peking University MOOC: "Bioinformatics: Introduction and Methods".

I am Ge Gao from the Center of Bioinformatics, Peking University.

This is a supplementary learning unit for this week's topic on Sequence Alignment.

This unit expands on and supplements the basic ideas in this week's three units with more advanced contents.

But these contents will not be tested in assignments or exams.

First, let's discuss how to improve the definition of the gap penalty.

As we mentioned in Unit 1,

a gap may have a length of more than 1 because multiple residues may be involved in an event of insertion or deletion.

This is different from substitutions.

Therefore, gap penalty is often defined as a linear combination

of gap opening penalty and gap extending penalty which are usually not equal.

In Units 2 and 3 for simplicity we used the same penalty score d

to both opening and extending a gap in global and local alignments.

To distinguish between opening and extending a gap, we need to introduce the concept of "status".

As we mentioned earlier in Unit 2, there are only three possible alignments for a given pair of residues.

We call them three possible "statuses".

"M" denotes an alignment of two residues, which are not necessarily the same.

"X" means that the residue in sequence X is aligned to a gap, or that there is an insertion in sequence X.

"Y" means that the residue in sequence Y is aligned to a gap, or that there is an insertion in sequence Y.

We can then apply the concept of Finite State Automation (FSA) from the field of computer sciences.

Sequence alignment can be formulated as transitions between different statuses.

Specifically, a transition from M to X (or Y) denotes the opening of a gap.

A transition from X to X (or Y to Y) denotes the extension of a gap.

A transition from M to M denotes the extension of the alignment without gaps.

We can then write the corresponding dynamic programming recursive formula.

M(i,j) stands for the score of the best alignment between the subsequence

of sequence X from the first base to the i-th base and the subsequence of

sequence Y from the first base to the j-th base, with Xi aligned to Yj.

X(i,j) or Y(i,j)) stand for the score of the best alignment between the same

X(i,j) or Y(i,j)) stand for the score of the best alignment between the same subsequences when Xi or Yj is aligned to a gap.

Let's take a closer look at each formula.

M(i-1,j-1) means that Xi-1 is aligned to Yj-1.

X(i-1,j-1) (or Y(i-1,j-1)) means that Xi-1 (or Yj-1) is aligned to a gap.

M(i-1,j) means that Xi-1 is aligned to Yj.

But because Xi is aligned to a gap, which is opening a new gap, a penalty of d is subtracted.

X(i-1,j) means that Xi-1 was already aligned to a gap.

So here the gap is extending and thus a penalty of e is subtracted.

Likewise, M(i,j-1) means that Xi is aligned to Yj-1.

But because Yj is aligned to a gap, which is opening a new gap, a penalty of d is subtracted.

Y(i,j-1) means that Yi-1 was already aligned to a gap.

So here the gap is extending and thus a penalty of e is subtracted.

Similar to the F function in previous units this week,

we can also represent these three recursive formular as filling in the cells on three planes

using the dynamic programming matrix.

Please note that the backtracking path might cross between planes.

Next we will discuss briefly about the time complexity of the global alignment algorithm.

As discussed before, let's put the sequences of X and Y on the x- and y-axis of a matrix.

The i and j in F(i,j) become the subscripts of the elements in this matrix.

The recursion of this formula then becomes filling in the blanks one by one from top to bottom and left to right.

Then for two sequences of length m and n, respectively, we need to fill in m*n cells.

Each filling takes a constant time c.

The total time needed will thus be directly proportionally to m*n, and the time complexity is O(mn)

We can see that Needleman-Wunsch algorithm reduces the time cost from the exponential time to the square time.

The time complexity is reduced significantly.

OK. Here are the summary questions for this unit.

You are encouraged to think about them

and discuss your answers and ideas in the online forum.

Thank you, and that's all for this week.

See you next week!