0:02

We just saw how to create a de Bruijn graph, and

Â we also learned what an Eulerian walk is.

Â It's a walk through the graph that goes from node to node and

Â that crosses each of the edges exactly once.

Â And we said that this corresponds to a reconstruction of the original genome, and

Â so it gave us a new algorithm for

Â reconstructing the sequence of a genome from a collection of sequencing reads.

Â So let's apply this idea in a couple of scenarios, and

Â let's see how it works in a couple of more complicated examples.

Â And we also want to see whether it actually solved the issues that we

Â had with the shortest common superstring formulation.

Â For example, the shortest common superstring had a tendency

Â to overcollapse repeats, so we're going to see if we have the same problem here.

Â Here's a de Bruijn graph, for a familiar example.

Â This is the genome that the shortest common superstring had trouble with.

Â So we collapsed the three copies of the word long into just two copies.

Â 0:59

Well, when we draw the de Bruijn graph for this string, for this genome,

Â using a camera length of five, and assuming that the sequencer sequences each

Â camera exactly once, then we end up with the picture that you see here.

Â We said previously that de Bruijn graphs produced in this way are Eulerian,

Â so they have an Eulerian walk, and so if we look at this graph we ought to be able

Â to find a walk through it that reconstructs the original sequence.

Â And we can.

Â So here it is.

Â We start here, and we spell out

Â a long, long, long time.

Â So there you go, that's the Eulerian walk through the graph.

Â 1:40

So again, the Eulerian walk did what we said it should.

Â It gives us the reconstructed genome sequence back again.

Â But the point is that it, actually in this case, did not overcollapse that repeat.

Â It did not overcollapse the repeat long_long_long.

Â It gave us the original sequence back, and that's an improvement over the shortest

Â common superstring formulation of the problem.

Â So that's great.

Â But let's not be lulled into thinking that we've solved all our problems.

Â So we actually can't escape from the third law of assembly which states that

Â repetitive portions of the genome make assembly difficult.

Â And let's see exactly how they make assembly difficult when

Â we use an algorithm that's based on the de Bruijn graph, like we are here.

Â 2:22

So here's another example, and this time things won't be quite so straightforward.

Â So the de Bruijn graph that you see on the right is the graph that corresponds to

Â the genome on the left using a camera length of three, and

Â as always we're assuming that each camera is given to us exactly once.

Â And this graph on the right is Eulerian like we said before because there's a way

Â to walk along the nodes of the graph such that we cross each edge exactly once.

Â However, there is more than one way to do so,

Â so there's more than one Eulerian path for this graph.

Â So one of those ways looks like this.

Â Let's start in ZA, and we're going to start by going down to AB.

Â And then we can go up to BE, and then along here,

Â and then down here and along here, and then down to BY.

Â And that's one of the Eulerian paths.

Â The other one goes like this.

Â We start from ZA, we go to AB, and we start by going down to BC and

Â going around like this.

Â Then we go up to BE and go around like this.

Â Then we go down to BY.

Â So here are those two walks that we just took,

Â the series of nodes that we visited along those walks.

Â 3:39

So there's ambiguity.

Â We haven't escaped the third law of assembly here.

Â Instead, the way that it manifests itself is, it means that the de Bruijn graph for

Â repetitive genomes will, in general, have many different Eulerian walks, and only

Â one of those walks actually corresponds to the original genome sequence.

Â All the rest of the walks correspond to incorrect reshufflings

Â of the genome where portions of the genome that occur between repetitive sequences

Â are going to get shuffled around and put in the wrong order.

Â 4:41

And then we build a de Bruijn graph using a camera length of four, and

Â then we call a function of the de Bruijn graph class here which is called

Â eulerianWalk which simply finds the Eulerian walk and

Â just returns basically a list of all the nodes visited during that walk.

Â 5:12

So, this next line concatenates together the labels of all the nodes visited

Â over the eulerianWalk, which is a very simple thing to do.

Â It respects the overlaps between them so

Â it actually starts by appending the first node label, and

Â then it appends one more character for each subsequent node visited.

Â And then it simply prints out the superstring that it's put together.

Â 5:48

So now let's repeat this entire experiment but

Â using a camera length of three instead of a camera length of four.

Â And by decreasing the camera length, we're increasing the chance of being

Â affected by repeats since the smaller k means there's more likely to be

Â multiple occurrences of any given k-mer in the genome which is going to increase

Â the chance that we have multiple different Eulerian walks for the same graph.

Â 6:18

our reconstruction of the genome is incorrect, so

Â it mixes up the order of these words that start with T right here.

Â All right?

Â So this is the curse of the repetitive genome which

Â in the case of the de Bruijn graph it results in a spurious reshuffling

Â of portions of the genome like you can see here.

Â 6:39

So now back to one other issue that we've been putting off,

Â our assumption about the sequencing data.

Â So we've been assuming that the sequencer is kind enough

Â to give us exactly one read per k-mer, and

Â without accidentally leaving out any k-mers, or giving us any k-mers twice, or

Â anything like that, or without sequencing errors, and things like that.

Â But of course none of these assumptions are actually true in practice.

Â The sequencer gives us sequencing reads, which are usually going

Â to be much longer than the value of k that we pick, by the way.

Â So typical values of k are around 30 to 50 or so, whereas sequencing reads

Â are usually around 100 to 200 or tp 300 or so bases long.

Â But of course, the sequencer does make mistakes.

Â There are sequencing errors.

Â The coverage is uneven.

Â It's not necessarily going to give you the desired number of copies of any particular

Â bit of the genome.

Â So given this reality,

Â we can certainly still use our procedure for building the de Bruijn graph.

Â In other words, we can take each read and chop each of the reads up into k-mers,

Â and then for each k-mer split it into it's two k minus 1-mers and

Â add the corresponding nodes and edges to the graph.

Â We can still do that,

Â but what we end up with is a graph that's not necessarily Eulerian.

Â In fact, it'll never be Eulerian in practice.

Â So, for example, if we have this genome, and

Â the sequencing reads reported by the sequencer are the three

Â reads that you can see here, and we're going to use a k-mer length of eight.

Â So when we chop those reads up into k-mers we get the k-mers that you see down here.

Â 8:14

So we can still build a de Bruijn graph out of these k-mers.

Â Right? We can build a de bruijn graph out of any

Â set of k-mers.

Â Then we can do that, and this is the graph that we get.

Â And if you look at this hard enough,

Â you can figure out that this graph is not Eulerian.

Â There's no Eulerian walk of this graph, but it's almost Eulerian.

Â I mean, if you go like this, if you go a long, long, long,

Â time, you'll notice that it was almost Eulerian.

Â The problem here was right here,

Â the number of edges going from this node to this node.

Â So finding Eulerian walks will not necessarily solve the assembly

Â problem for us.

Â 8:56

Now that doesn't mean what we just investigated was pointless.

Â So for one thing, it was an instructive example of how repeats make assembly

Â different no matter what algorithm we use, and we solved two different algorithms in

Â two different ways that the repeats made the assembly problem difficult.

Â 9:12

But, another reason is because this de Bruijn that graph we studied,

Â the de Bruijn graph representation is actually a very common and useful way to

Â represent assemblies, to represent the assembly problem, and in fact,

Â many modern software tools for assembly use de Bruijn graphs as the internal

Â representation of the relationships between the sequencing reads.

Â So while shortest common superstring and Eulerian walk are both flawed

Â formulations of the assembly problem, the overlap graph and the de Bruijn graph

Â are both going to continue to be very useful to us in practice.

Â