As we've discussed already, this is just a small example of

a genome assembly problem, and it is actually an over simplification.

In real life, when we assemble a real genome, life is much more complicated,

because these strings are longer.

The resulting strings that we are looking for is of length millions or

even billions, and reads contain errors, I mean the substrings contain errors.

And also there are many repeated elements and so on.

But still, using this toy example, we will be able to

demonstrate the main ideas used in modern genome assemblers.

And as we've discussed already, we are going to use Hamiltonian cycles and

Eulerian cycles to solve this problem.

So in search for an inspiration for solving this problem,

let's consider some known string, and let's consider all its 3-strings.

For example let's consider the word DISCRETE and

all its possible substrings of length 3.

So what we see by looking at this example is that any two

consecutive strings have an overlap of size 2.

By saying overlap, we just mean a common part of size 2.

So for example, the first two substrings, this is DIS and

ISC, they have this common part, IC, right?

And this is, of course, true for any two consecutive strings.

So for example, for CRE and RET, the suffix of CRE,

it is just RE, is equal to prefix of RET, which is also RE.

This just reflects the fact that all these three strings come from the same

initial string, right?

And this gives us the following idea.

What we need to find is an order on our 3-strings

such that if we place these 3-strings in this order,

then there is an overlap of size 2 between any consecutive strings.

For example, the initial order doesn't work because right after AGC

goes ATC and they do not have this overlap of size two, right?

So what we need to find is a permutation of all these 3-strings

which satisfies this property that there is always an overlap of size two

between any two consecutive strings, okay?

But this is actually a Hamiltonian cycle problem, why is that?

Well, because we can draw the following natural graph.

Let's use our strings as nodes in our graph.

And let's add a directed edge from this string to that string if there

is an overlap of size two between them.

Namely, if the suffix of this string of length

two is equal to the prefix of length two of this string, okay?

Let me give you an example.

For example, in this graph, there is an edge from CCA to CAG,

exactly because the suffix of CCA is equal to two,

is equal CA, I'm sorry.

And the prefix of CAG of length two is equal to CA.

So this edge basically reflects the fact that CAG potentially might follow

CCA in the string that we are looking for.

At the same time, there is no edge from GCA, for example, to ATC,

because we know for sure that ATC should not follow GCA, okay?

Because there is no overlap of size two between these two strings, okay?

So we're looking for a Hamiltonian cycle in this graph.

And this is a Hamiltonian cycle.

And in fact, when you have a Hamiltonian cycle, it is not so

difficult to reconstruct the string.

So let's do this together, so we start with the stream TCA.

So at this point we have just a string TCA, and

then by traversing the Hamiltonian cycle that is shown here,

we will gradually extend our string till we get the resulting string.

So from TCA we go to CAG, so we append G to our current string.

From CAG we go to AGC so we append C.

Then we go to GCA, then to CAT, to ATC, to TCC,

and finally to CCA, which finally gives us the resulting string.

So let's just do the sanity check, let's check, so

the first substring is TCA, it is exactly what is needed.

The second is CAG, this is our string.

The next one is AGC, it is AGC, okay,

GCA, GCA, then CAT, CAT is here.

Then ATC, ATC is here, then TCC which is here,

and finally CCA, CCA.

So indeed all the substrings of length three are exactly our initial substrings.

So in fact, we solved the problem.

But the issue with this solution is that, in fact, we reduced our problem

of finding the unknown string to the Hamiltonian cycle problem.

But the issue is that for Hamiltonian cycle problem,

we do not currently have a provably efficient solution.

We do not have an algorithm or a program that solves Hamiltonian cycle problem for

all graphs in practice of size, for example, up to one million or something.

So unfortunately,

the Hamiltonian cycle problem is still an intractable problem for us.

So if the graph has millions of nodes and edges, we do not expect our current

technologies to be able to find a solution quickly for such graphs.

So we definitely need a different method to be able to solve the real

genome assembly problem where the number of the strings is counted in millions or

even billions.

And what we're looking for is also a huge string of size millions or billions, okay?

And as you might have guessed already,

Eulerian cycles are going to help us in this case.

So in the graph that we constructed, it is the so-called overlap graph because, well,

basically an edge represents an overlap.

In this case, each of our input strings is represented by a node.