0:03

We're now at the point where we can formulate a computational problem

Â that when we solve it, will in turn solve the genome assembly problem.

Â This first formulation won't be perfect, but

Â it'll be a good starting point for our discussion of the assembly problem.

Â Kind of in the same way that the naive exact matching problem wasn't perfect, but

Â it was good starting point for our discussion of read alignment.

Â So, the computational problem is called the shortest common superstring problem,

Â which will sometimes abbreviate as SCS.

Â So the shortest common superstring problem is this.

Â Given a collection of input strings, a set of input strings, we'd like to find

Â the shortest string that contains all of our input strings, as substrings.

Â 0:47

So for example, here I have some strings and

Â I'd like to know their shortest common superstring,

Â the shortest string that contains each of these strings as a substring.

Â Well, if we didn't have the requirement that the superstring be the shortest

Â superstring, then this problem would be easy,

Â we could simply concatenate all the strings in the set S.

Â 1:16

So, it turns out this is the shortest common superstring,

Â so this is a string that contains all the input strings as substrings and

Â there is no shorter string that does this, that contains all the input strings.

Â 1:29

So let's say for a moment that we have an algorithm for finding the shortest common

Â superstring of a set of strings, and we'll discuss such an algorithm soon.

Â So if we took our sequencing reads and we gave them to this algorithm, and

Â asked the algorithm to find their shortest common superstring,

Â then the solution to the problem, the shortest common superstring problem,

Â would also be an assembly of the genome.

Â It would be a reconstruction of the original genome sequence.

Â So let's look at an example.

Â 1:59

Here's our overlap graph from the previous lecture.

Â So this graph is built from a synthetic data set that

Â contains the reads that we made by taking all the 6-mers of the genome.

Â And let's instead take all these 6-mers and then throw them

Â into an algorithm that solves the shortest common superstring problem.

Â So here I have some Python code that calls a function called scs which we haven't

Â implemented yet but we will implement it in an upcoming practical.

Â And we pass in a list, a Python list of strings and

Â those strings are the read sequences.

Â And then what we get back is the shortest common superstring which is shown here.

Â And you can see that this shortest common superstring is in fact

Â equal to the genome that we derive the reads from.

Â 2:47

Okay, so when we do that, we get the correct answer so

Â we should be pretty pleased with our progress so far.

Â We've formulated a computational problem that seems to correspond very closely to

Â the assembly problem that we would like to solve.

Â 3:02

Furthermore, this computational formulation has a nice feature,

Â it's going to find us the shortest common superstring,

Â not just any common superstring, but the shortest one.

Â And in some sense, this is like the most, the simplest explanation, or

Â the most parsimonious explanation for the set of input reads that we started with.

Â 3:27

Well, so now, unfortunately,

Â the shortest common superstring also has some substantial downsides.

Â So, we'll look at these and some potential solutions in some upcoming lectures, so

Â let's look at the first downside.

Â 3:41

The first down side to this problem is that it is just not tractable.

Â There are no efficient algorithms for solving it.

Â So the technical term is that the shortest common superstring problem is NP-complete.

Â And NP-complete problems are very, very unlikely to have an efficient solution,

Â 3:58

this doesn't mean it can't be solved.

Â In fact, we'll see an algorithm for solving it pretty soon.

Â But it's not going to be very fast,

Â and as the number of input strings grows, it's going to get slow very, very quickly.

Â 4:19

So let's say we have some input strings and

Â assume for a moment that they're all the same length.

Â And what we're going to do is,

Â we're going to put them in some order, like this.

Â Maybe these are our input strings and

Â this is the first order we're going to try to put them in.

Â This happens to be alphabetical order.

Â 4:40

And then, we're going to, for every adjacent pair of strings in

Â this ordered list we're going to find the longest overlap between them.

Â So that is, we're going to find the longest suffix of the string on the left

Â that matches a prefix of the string on the right.

Â And then we are going to glue them together accordingly.

Â We're going to merge them together accordingly.

Â So for example, these first two strings in our

Â ordered list overlap by two characters.

Â The suffix AA from the first string overlaps the prefix AA from

Â the second string.

Â And there are no longer overlaps between these strings so

Â we glue them together like this.

Â So we started with the first string, AAA.

Â And then we tacked on just the B from the second string.

Â We ignored the overlaps characters because those are already in our superstring that

Â we're building, and we just added on the B from the second string.

Â Okay, so now, let's look at the second pair.

Â And again, they have an overlap of length two because there's a suffix AB for

Â the string on the left and there's a prefix AB for the string on the right.

Â So this time, we're going to concatenate just the A from the second string.

Â So we're concatenating just one more A onto our superstring that we're building.

Â And we can keep going like this.

Â So for this third pair, the overlap is one, right?

Â Because the suffix A matches the prefix A.

Â So we're going to add on BB to the end of our superstring so far and so on.

Â We'll just keep doing this for every adjacent pair of strings until we've done

Â all those pairs, and at the end of the day, we have a candidate superstring.

Â It's not necessarily the shortest common superstring, but it might be.

Â If we pick the right ordering then it will be the shortest common superstring.

Â 6:25

So for a given ordering, that's how we produce the candidate superstring.

Â And again, it's not necessarily the shortest.

Â And for different orderings we'll get different superstrings.

Â So, for example, here's another kind of order that we could put the strings in.

Â It's just a slightly different order than the first one.

Â And then through the same procedure we'll produce a superstring.

Â And, lo and behold, you can see that it's shorter than the first superstring by

Â I guess that is four characters, it's shorter than the first superstring.

Â So if we want to find the absolute shortest common superstring

Â we're going to have to try all the ordering.

Â 7:02

So that's our algorithm for every possible way of ordering the N input strings, for

Â every permutations of the n input strings.

Â We're going to glue adjacent pairs of

Â strings together according to their maximal overlap.

Â And then, the shortest common superstring that we get over all those orderings

Â is the shortest common superstring overall.

Â 7:22

So why is this so slow?

Â Why do I say that this problem is intractable?

Â Well, this particular algorithm is slow because the number of possible

Â orderings that we have to try is equal to the number of input strings,

Â let's say that's n, that's equal to n factorial,

Â because that's how many different permutations of n there are.

Â 7:43

So n factorial is a quantity that grows very quickly as n grows.

Â As the number of input strings grows, the amount of work that we have to do,

Â the amount of permutations we have to try grows very, very rapidly.

Â That's why we call this problem an intractable problem.

Â So, in the coming practical session, we'll implement this idea and

Â then in the following lecture we'll talk about one way that we can speed things up

Â quite a bit, though, kind of phrase.

Â