0:04

Now we're going to look at tries, which is a class of common material objects that

not really has only come under study in recent years due to computer applications.

They're not found in classical common but

they're actually a very interesting rich combinatory analytic properties.

So I'll spend some time really talking about what tries are and

the application before I talk about the analysis.

So when we look at the tries just as a binary tree,

where the external nodes can be marked.

0:51

Now there's a rule and that is that you can never have two void

nodes that are siblings of each other in a leaf.

So that a sibling of a void node is not void,

that's the rule, so that's what a trie is.

Now that seems kind of arbitrary but you'll see when we

look at applications, how these rules play a role.

Actually for an exercise, I might give a recursive definition of what a trie is.

So the most usual way we think of trie is as representing a set of good strings.

Each trie corresponds to a set of bitstring where each nonvoid,

external node represents one bitstring.

And you get the big string by taking the path from the root to a node,

taking 0 when you go left and 1 when you go right.

For example, if we choose 00, left, left, right,

right, left, we get down to that nonvoid external node.

And we say that node represents the bitstring that we got that defines

the path that we got there, 00110.

2:09

Now over here, 1010, that nonvoid

node represents the bitstream 1010.

The path from the root to a node defines the bitstring.

So now the void nodes,

I have a different interpretation that derives right from this definition.

For example, if we go right four times and then left, we come to a void node.

What that means is that no string with that prefix

is in the set of strings represented by this trie.

The trie represent in this case one two, three, four, five,

six, seven bitstrings and in the void node represent the prefix of all

the bitstrings that aren't represented.

So that's what we think of a try corresponds to set of bitstrings or

another way to look at it, is it's all worked out here.

So this trie represents as I said,

seven different bitstrings, and those are shown at the top.

Now on the left is the bitstrings from that set that start with 0, and trie on

the right represent the bitstrings of that set start with 1 with the 1 script of.

And so recursibly going down,

that's another way to see the sets of bitstrings that are represented.

Now this only works for sets of bitstrings are said to be prefix free.

So that is no member of this set of bit strings is

a prefix of another one in that set.

We can handle that by using void and not void, internal nodes.

But in applications, I'm going to talk about that are typical,

it's okay to just work with prefix free sets.

4:06

So for example, fixed length,

all those bit strings are the same length then it's prefix free.

Because they're all different, they're all the same length and they're different.

You can't have one being prefixed not another,

and that's a typical and useful application of tries.

So this is a trie that represents those three strings and

lots of applications of tries if you look in our algorithms

book you'll find trie code for sorting.

For simple tables with string keys and for

suffix arrays which I'll refer to in a minute.

They play a role in classic data compression algorithms Huffman code.

And Lempel–Ziv–Welch compression and we'll look at the use of tries to

understand decision making collision resolution leader election algorithms.

They play a very important role nowadays in network systmes and

bioinformatics, Internet search, all kinds of commercial data processing.

Very important data structure that's often overlooked that's

where I'm taking the time to talk about some of these applications to

motivate the analysis because it's not so easy to analyze as we'll see.

So here's the basic application which is for symbol tables.

Trie represents a set of bitstrings so what we have is,

just going from the definition.

A search algorithm for

determining whether a given bitstring is in the set represented by the trie.

And the basic idea is if the leading bit of your key is 0, go to the left,

if it's 1, go to the right then use the remainder of the string recursively.

If you get to a void external node, it means that the one you're looking for

is not in the set represented by the trie.

If you get to a nonvoid external node and

you're at the end of your bitstring, then you report success, you did find the key.

So for example, let's say we're going to search for the bitstring 0011 in this tri.

Start with a 0, go to the left.

Next one's a 0, go to the left.

Next one's a 1, go to the right.

Next one's a 1, go to the right.

We're at the end of our string and we're on a nonvoid external node.

So that string is in the set of business strings represented by our trie.

Let's look for 10110.

So we start with a 1 and go to the right, 0 go to the left,

1 go to the right, 1 go to the right.

We hit a void external node, so

that string is not in the set represented by our trie.

Very natural search algorithm have the trie represent a set of bitstring.

Of course everything can be represented as a bitstring,

so this is a natural algorithm for anything represented by in a computer.

7:17

Now what about inserting new keys or

a new bitstring into the set represented by the trie.

Well what we'll do is we'll

insert by searching until we get to a void external node.

So if we wind up at a internal node or nonvoid external node that means

that we'll have a prefix free violation and we can deal with that in some way.

But insert a new key, it's going to wind up at a void external node.

So to insert 01110, we go left for the 0,

1 right for the 1, and now we're at a void external node.

So now what we want to do is for each remaining bit

in our key that we want to insert we want to add a new internal node.

And with one void external child and then the other one corresponding to our bit.

So in this case our next bit is 1.

So we put if the key start at 010 then it is not in the set of strings, but

011 that could be this one.

Then we do it again for another one, and then the next bit is 0.

So we put the void external node to the right and nonvoid to the left.

So that's we would insert 01110 into this trial.

8:36

Now there are variants where you just keep track of the tail in

some way with pointers and people, and those are well studied,

and there's lots of reasons to do that sort of thing.

But the simplest version also is very effective and

that's what we'll stick with right now.

So that's a search algorithm and an insertion algorithm,

that gives the basis for simple table using the trie data structure,

which is a very useful algorithm.

And then a natural question that probably occurred to many of you is,

what about these void external nodes?

It seems kind of a waste to have all these void external nodes there in

this data structure.

And so we're going to want to analyze how many there are to make sure that we

understand how much space it trie it takes.

But it's a very compact data structure and

that analysis is certainly interesting and relevant in practice.

So here's another application of tries,

what we want to do is we have a given string, S.

And just for an example, I'll use a genomic string made up of As,

Cs, Ts, and Gs.

And these things could be huge, it would be billions of letters.

And we want to know is a particular substring in our string for

this string, is ACCTA in there?

And the answer is yes, starting at 0.

What about CCT?

Yeah, there's plenty of places where CCT occur in this string.

What about TGA?

No, there's no occurrence of TGA.

So we have a specific string that's huge and

we want to be able to quickly do substring search.

There's all kinds of application in genomics where it's important

to be able to do an operation like this quickly.

So searching genomic data, and this is also useful in Internet search.

When you do a Google search you not only get to the page that you're looking for,

but you get context you get where the substring is in that page.

And that uses data structure like this and many other applications.

10:57

So the solution method that I'll talk about

is the so-called suffix multiway trie.

Generalizing the trie for this problem and so

the idea is if you have given string well what were

going to do is work with all the suffix of the string.

So the original string, ACCTA GGCCT,

we leave off the A, then we have CCTA, and so forth.

So if the string is of size N, then we have N strings for all the suffixes.

And we're going to treat those as different strings,

and just insert them into the trie, that's called a suffix trie.

You now notice that's prefix free, none of these is a prefix

of another because they're all different lengths.

It's a prefix for each set, and we have them

all end with a character that isn't found anywhere else in the trie.

So then the idea is that every internal node of this

trie corresponds to some substring of our original string.

So to answer the question, is x a substring of x?

We use the characters of our query to traverse the trie.

So for example, if we're looking for

ACCTA then [COUGH] when we get to a nonvoid external

node that's tells us that that one is in there and

tells us what position it is so we built the trie AC.

Now if we see something that starts with AC, then we look starting

with position 0 and continue our search and this CCTA is there.

If we encounter a void node then so

for example, if CCT is there because

we found that in the internal node.

But something like TGA, we end up at a void node and router there.

If we reach the end of the string then we're fine,

if we hit a void node we're fine, where a node that is not there.

So this is a very simple algorithm to answer this is x a substring of s

question, and a fine application of tries.

And again the number of void nodes and what does it mean to be a random trie and

how long is the search and so forth.

All of these kinds of questions are going to be important and

relevant in practice.

Here's another application of tries.

Tries is a model for an algorithm, so called leader election algorithm.

This is important in distributed systems, so

the idea is you have a group of individuals.

And they would need to elect a leader, and

what they're going to do is each flip a coin.

So it's than distributed, there could be a large number of them,

they'll each flip a coin and we'll count 1s as winners and 0s as losers.

So everybody that gets a one when they flip a coin survives for

the next round and the ones that get zero are gone.

15:24

So obviously,

we are going to be interested in what's the chance of failure.

Well it's the probability that the right most path in a random trie

ends in a void node.

And what do I mean by a random try?

Well it turns out that the model that associates with this, and

also works for symbol tables is, it's a trie that you get by inserting

infinite length random bitstrings into an initially empty trie.

So there's three diverse applications of the trie data structure for

a simple table, for substring search and for distributed leader election.

And all of this is to motivate studying this in a combinatorial structure.