0:03

We discussed how to build and query a k-mer index, an index that's built by

Â taking all the k-mers of the text T and adding them to a data structure that maps

Â each k-mer to a list of all the offsets where it occurred in the text.

Â So this kind of data structure is called a multimap.

Â It's a map because it associates keys, k-mers,

Â in this case with values, offsets in the genome.

Â And it's a multimap because a k-mer may be associated with many different offsets

Â in the genome, right?

Â A k-mer could occur many places within the genome, within the text.

Â So what kind of data structures can we use to implement a multimap?

Â So we'll discuss two of them here.

Â The first one is based on ordering, like the index of a book, and

Â the second one is based on grouping, like the aisles of a grocery store.

Â So let's talk about the first data structure, which is based on ordering.

Â So here at the top of this slide, I have a key-value pair.

Â The key is a 3-mer from the text T.

Â It's the very first 3-mer.

Â And the value is the offset where that 3-mer occurs.

Â So we make a key-value pair for every 3-mer at every offset within T.

Â 1:28

So now, this is our index data structure.

Â It's simply a list of 3-mer offset pairs ordered alphabetically by 3-mer.

Â So now how do we query this index?

Â Say we have a pattern P from which we extract a 3-mer, and

Â we'd like to query the index with this 3-mer.

Â And the way we're going to do this is with something called binary search.

Â 1:51

So, what is binary search?

Â So, going back to our index analogy for

Â a moment, let's say we're looking up a term in this index.

Â We're looking up the term memory.

Â And, so we flip to the exact middle of the index and we look in the middle.

Â And we find the key term that's there, and

Â let's say the term directly in the middle of the index is light.

Â So light comes alphabetically before memory, so we know that we can completely

Â ignore the first half of the index, up to and including the term light.

Â So memory must be in the second half, so then what do we do next?

Â Well we do the same thing but just for the second half of the index.

Â So in this way we can keep going iteratively throwing away half the index

Â each time for each iteration.

Â And eventually we can hone in on exactly the term that we are looking for.

Â So this is called binary search.

Â So let's see an example using our index.

Â 3:11

So we can ignore in the index, every entry up to and

Â including GTG, and we've effectively divided the problem in half.

Â And each time we divide the problem in half in this way,

Â we call this a bisection.

Â 3:56

So that match corresponds to an index hit.

Â We wanted to find offsets where TGG occurs.

Â We've now found the first one in the index, and it occurs at offset 7.

Â So, the total number of bisections that we need to perform

Â in order to find our key in the index is approximately

Â equal to the logarithm base 2 of the number of keys in the index.

Â Why the logarithm base two?

Â Well, because for each of our bisections,

Â we're repeatedly dividing the problem in two.

Â So we'll implement this idea and some related ideas in Python later.

Â But before we move on, let me make a final point, which is that Python actually

Â provides a set of functions related to binary search that are useful for us here.

Â So these functions are all in a Python module that's called bisect.

Â 4:46

And a Python module if you haven't encountered it before is just a collection

Â of related Python functions and classes.

Â And this module is about binary search.

Â So one function in this module is called bisect_left and

Â this function takes two parameters.

Â The first parameter a is a list that's already sorted.

Â It's already ordered in ascending order.

Â 5:11

So if this is a list of strings then the strings should already be in

Â alphabetical order.

Â Or if this is a list of integers then they should be in ascending order.

Â And the second parameter x is an item and the function returns the offset where

Â the item x can be inserted into the list a while maintaining the order of the list.

Â So it's the leftmost position where x can be inserted in a such that

Â a is still in order after that insertion occurs.

Â So, bisect left is a useful function to us.

Â In this example we call bisect left with the parameter a,

Â which is this list up here.

Â And then the argument 2 and so bisect left is going to return the left

Â most index where we can insert 2 into this list such that the list is still in order.

Â And that's offset one, right here, so if we wanted to insert 2 into this list and

Â have this list still be in order we would put it right here between this 1 and

Â the first 3.

Â We would stick it right here and then we would shift all these entries over by one.

Â Here we're calling bisect_left with the parameters a and 4.

Â Bisect_left in this case returns the offset 3.

Â That means if we wanted to stick 4 in this list,

Â we would do it between here and here.

Â So we'd do it between the 3 and the 6.

Â And then in this final example we're calling bisect_left with a and

Â the parameter 8.

Â And it's telling us that if we want to insert a,

Â 8 into this list, then the leftmost place we can insert it such that the list is

Â still in sorted order is at offset four.

Â 6:52

Now we could have also inserted it here at offset five or here at offset six.

Â And the list would still be in sorted order, but

Â bisect_left is always going to return the leftmost offset where we can insert it so

Â that the list is still in sorted order.

Â 7:09

So this bisect left function is exactly what we need in order to do queries of

Â this index data structure that we built.

Â So for example if we're querying with this pattern here and

Â we take some 3-mer from the pattern, let's say we take the threemer GTG.

Â We can use the bisect left function on our index and the parameter GTG.

Â And it will return the offset of the first position where we could insert GTG

Â while maintaining sorted order of the list.

Â And that corresponds to this entry right here,

Â which is the first entry that has GTG as its key.

Â So if we wanted to find all the places where GTG occurs,

Â we could use bisect_left.

Â It would point as here, and then we could look up this offset and then, keep going.

Â Find another GTG and find its offset.

Â And then, keep going and find another GTG and find its offset.

Â So we would report that GTG occurs at offsets zero, four and six.

Â