In this video we'll cover a second problem to whet your appetite for things

to come, namely the problem of sequence alignment.

So this is a fundamental problem in computational genomics.

If you take a class on the subject it's very likely to occupy the very first

couple of lectures. So in this problem you're given two

strings over an alphabet and no prizes for guessing which is the alphabet we're

most likely to care about. Typically, these strings represent

portions of one or more genomes. And just as a toy running example you can

just imagine that the two strings were given are A, G, G, G, C, T and A,

G, G, C, A. Know that the two input strings do not

necessarily need to be of the same length.

And informally speaking, the goal of this sequence alignment problem is to figure

out how similar the two input strings are.

Obviously, I haven't told you what I mean by two strings being similar.

That's something we'll develop over the next couple of slides.

Why might you want to solve this problem? Well, there's actually a lot of reasons.

Let me just give you two of many examples.

What will be the conjecture or the function of regions of a genome that you

don't understand, lets say the human genome,

from similar regions that exist in genomes that you do understand or at

least understand better, say the mouse genome.

If you see a string that has a known function in the well understood genome

and you see something similar in the poorly understood genome, you might

conjecture it has the same or similar function.

A totally different reason you might want to compare the genomes of two different

species, is to figure out whether one evolved directly from the other and when.

A second totally different reason you might want to compare the genomes of two

different species is to understand their evolutionary relationship.

So for example, maybe you have three species A, B, and C, and you're wondering

whether B evolved from A and then C evolved from B, or whether B and C

evolved independently from a common ancestor, A. And you might then take

genome similarity as a measure of proximity in the evolutionary tree.

So having motivated the informal version of the problem, let's work toward making

it more formal. In particular, I owe you a discussion of

what I mean by two strings being similar. So to develop intuition for this, let's

revisit the two strings that we introduced on the previous slide A, G, G,

G, C, T, and A, G, G, C, A.

Now, if we just sort of eyeball these two

strings, I mean clearly they're not the same string.

But, we somehow feel like they're more similar than they are different.

So, where does that intuition come from? Well, one way to make it more precise is

to notice that these two strings can be nicely aligned in the following sense.

Lets write down the longer string, A, G, G, G, C, T.

And, I'm going to write the shorter string under it, and I'll insert a gap, a

space to make the two strings have the same length.

I'm going to put the space where there seems to be quote unquote a missing G.

And then, what sense is this a nice alignment, well, it's clearly not

perfect. We don't' get a character, by character

match of the two strings, but there's only two minor flaws.

So on the one hand, we did have to insert a gap and we do have to suffer one

mismatch in the final column. So this institution motivates defining

similarity between two strings with respect to their highest quality

alignment, their nicest alignment. So we're getting closer to a formal

problem statement, but it's still somewhat underdetermined.

Specifically, we need to make precise why we might compare, why we might prefer one

alignment over another. For example, is it better to have three

gaps and no mismatches or is it better to have one gap and one mismatch?

So if in this video, we're effectively going to punt on this question. We're

going to assume this problem's already been solved experimentally, that it's

known and provided this part of the input which is more costly, gaps and various

types of mismatches. So here, then, is the formal problem

statement. So, in addition to the two strings over

A, C, G, T, we are provided as part of the

input, a non-negative number indicating the cost we incurred in alignment for

each gap that we insert. Similarly, for each possible mismatch of

two characters, like, for example, mismatching an A and

T. We're given as part of the input a

corresponding penalty. Given this input, the responsibility of a

sequence alignment algorithm is to output the alignment that minimizes the sum of

the penalties. Another way to think of this output, the

minimum penalty allignment is, we're trying to find in affect the minimum cost

explanation for how one of these strings would've turned into the other.

So we can think of a gap as sort of undoing a deletion that occurred some

time in the past and we can think of a mismatch as representing a mutation.

So this minimum possible total penalty, that is these values of this optimal

alignment is famous and fundamental enough to have its own name namely the

Needleman-Wunsch score. So this quantity is named after the two

authors that proposed efficient algorithm for computing of the optimal alignment.

that appeared way back in 1970, in the Journal of Molecular Biology.

And now, at last, we have a formal definition of what it means for two

strings to be similar. It means they have a small NW score, a

score close to 0. So for example, if you have, if you have

a database with a whole bunch of genome fragments,

according to this, you're going to define the most similar fragments to be those

with the smallest NW score. So, to bring the discussion back squarely

into the land of algorithms, let me point out that this definition of genome sum,

similarity is intrinsically algorithmic. This definition would be totally useless,

unless there existed in efficient algorithm that given two strings and its

penalties computes the best alignment between those two strings.

If you couldn't compute the score, you would never use it as a measure of

similarity. So this observation puts us under a lot

of pressure to devise an efficient algorithm for finding the best alignment.

So how are we going to do that? Well, we can always fall back to

brute-force search, where we iterate over all of the conceivable alignments of the

two strings, compute the total penalty of each of those alignments, and remember

the best one. Clearly, correctness is not going to be

an issue for brute-force search. It's correct essentially by definition.

The issue is how long does it take? So let's ask a simpler question.

Let's just think about, how many different alignments there are?

How many possibilities do we have to try? So if [INAUDIBLE] let's imagine, I gave

you two strings of length 500, which is a knot of a reasonable length.

Which of the following english phrases best describes the number of

possibilities, the number of alignments given to strings

with 500 characters each? So I realize this is sort of a cheeky

question, but I hope you can gather that what I was

looking for was part D. So you know?

So, how big are each of these quantities, anyways?

Well, in a, in a typical version of this class, you might have about 50,000

students enrolled or so. So that's somewhere between 10^44 and

10^5.5. The number of people on earth is roughly

7,000.000.000. So that's somewhere between 10^9 and

10^10/10. The most common estimate I see for the

number of atoms in the known universe is 10^80.

And believe it or not, the number of possible alignments of two strings of

length 500 is even bigger than that. So I'll leave it for you to convince

yourself that the number of possibilities is at least two raised to the 500.

the real number is actually noticeably bigger than that. and because 10 is at

most 2^4, we can lower bound this number by 10^125 quite a bit bigger than the

number of atoms in the universe. And the point of course, is just that

it's utterly absurd to envision implementing brute-force search even at a

scale of a few hundred characters. And you know, forgetting about these sort

of astronomical, if you will, comparisons even if you had string lengths much

smaller, say in the you know, a dozen or two, you'd never ever run brute-force or

this is not going to work. And of course, notice this is not the kind of problem

that's [INAUDIBLE] This just doesn't go away if you wait a little while for

Moore's law to help you. This is a fundamental limitation. It

says, you are never going to compute alignments

of the strings that you care about, unless you have a fast,

clever algorithm. .

I'm happy to report that you will indeed learn such a fast and clever algorithm

later on in this course. Even better, it's just going to be a

straightforward instantiation of a much more general algorithm design paradigm.

That of dynamic programming.