Today we're going to talk about tries, which is a data structure for searching

with string keys that is remarkable effective in enabling us to deal with huge

amounts of data that we have to process nowadays.

Just as a motivation, when we left off talking about searching algorithms, or the

symbol table implementations here's where we were.

The best algorithms that we examined were balanced red, black, binary search trees,

in hash tables. And With respect to the basic search,

insert and delete operations. Red black BSTs we're able to guarantee

that we can get those done in logorithmic time, actually, pretty efficiently.

And for hashing, we'd get'em done in constant time under certain assumptions.

That we could have a uniform, hash function.

And also, for hashing, what we need to compute a hash, code for binary search

trees, we use, a compare to function. And another difference is, binary search

trees, we can support more operations. Than we can support with hashing ordered

operations. For example, getting the rank of a key in

the tree, and other things. And those are terrific algorithms.

And we looked at plenty of go, good applications of those algorithms.

But still there's the question could we do better?

And the answer is that yes, we can do better.

As with string sorting, we can avoid examining the entire key.

Which is going to give us algorithms that can compete and even beat these classic

algorithms. So to get started we are going to

articulate in API for symbol tables that specialize for the case when the keys are

strings. So simply, we just add the modifier string

to our standard symbol table EPI. And we can take a generic value just put

the keys or strings. And then we take all the methods that we

articulated before. And for key type, instead of generic, we

use string. And this is going to allow us to develop

efficient algorithms that take advantage of properties of strings.

Our goal is to get, some algorithms that are even faster than hashing.

And maybe, more flexible than BSTs. And we're going to be able to achieve both

those goals. So, this again is the summary of the

running times for when, in the case that the keys are strings for red black BSTs

and for hashing. And let's take a look at those so if L is

the length of the string. And n is the number of strings and then

also the rate x is r is going to be a factor later on but its not going to be

for these two. Then for hashing for every operation we

have to compute a hash function, which basically means looking at every character

in the string. Actually in Java string hash functions are

cached, so sometimes there's efficiencies cuz you only have to comp, and they're

mutable, so you only have to compute the hash function once.

And then this is roughly the amount of space that takes into account the table

and the string size for red black BSTs, the analysis is a bit more complicated.

For a search hit, you have to look at the entire string.

Given a entire key to check that every character's the same and then the

comparison's depending on the nature of the keys for a typical case though its

something like log squared N. we had log N comparison's but usually you have to look

at something like a log N characters in the key in order to get down the tree.

For a search miss you might be able to find out before you get to the end of the

tree and sing a for an insert so the running times are something like this.

Then here's some experimental evidence that, that we, we can use to test out

these analytic hypothesis. And we'll just look at two examples one is

the entire text of Melville's Moby Dick which we've used before and that's about a

million characters. And maybe 200,000 strings.

And add to this 200,000 strings, about 32,000 are distinct.

And then this, There's another file of actors from the

Internet movie database that's much bigger maybe tourism magnitude bigger.

And it's got maybe about a million different words.

So we'll want our algorithms to do better than, or at least as well as red black

BSTs in hashing on these data sets. I think, you know, our test is to dedupe

these data sets. So that's our challenge.

Efficient performance for string keys and try to come up with algorithms that can

compete with the best classic algorithms. Now in order to do that we're going to

look at R way tries. That's a particular data structure, that

was, invented actually really quite a while ago.

This data structure, dates back, to the 60s.

And, the first thing to know about it is that, the name is based a little bit, on a

pun. It actually is the middle letters of the

word retrieval, but we don't pronounce it tree because then we couldn't distinguish

it from binary search trees, so we pronounce it try.

And this is a confusing fact of life that we've been living with for many decades

now with this data structure. And let's look at what a try is and we'll

look at some, examples before we get to the code.

So, for now, we're going to think of a try as some nodes.

A tree structure with nodes where we store characters in the nodes,

Not keys. And the other thing is that each node has

R children, where R is the rate, it's the number of possible characters.

And there's g-, the possibility of a child for each possible character.

Now in the standard implementation, this is going to involve on all links.

We have to have a placeholder for each possible character.

So there's a lot of null links and tries. And we'll get back to that in a minute.

But to get the concept, we'll use this graphical representation, where we have

characters and nodes and we don't show any null length.

And the other thing is, remember a symbol table is associating a key with a value,

so this try is built for this set of key value pairs.

And what we do is we. Store values in nodes that correspond to

the last characters in keys. And we'll look, that's what these numbers

here are. And we'll look at a little more in detail,

on how this try is built in just a second. So that's the basic idea.

We're going to have string keys. We're going to associate values, and we're

going to use this tree-like, tree data structure but we're not going to draw the

imply null links for now. So, for example, in this trie so the root

re-, represents all strings. And, coming out of the root is one length

for each possible letter. In particular in this try, the middle link

is the link to the, to a sub try that contains all the keys that start with S.

Then we go further down. Each time we go by a node, we pick off a

letter so that this length, for example, goes to the try for all keys that start

with s followed by h followed by e. And so forth.

So now all, out of all the keys that start with SHE, actually one of them, the one

that just has the letters in then ends, has a value associated with it.

So when the node corresponding to the last letter, the E, we put the value zero.

And so that's, how the try is going to look.

So just given that definition, then let's look at, how to do search, in a try.

All we do is use the characters and the key to guide our search down the data

structure. So, let's say we're looking for the key

shells. So we start at the root, and we go down

the S link, since we started with an S. Now our second letter is H, so we look to

see if there's a non-null link that, corresponding to H, and in this case there

is. Now our next letter is E, so we look for

an E. Now.

No need to examine the value here because well we haven't looked at all the letters

in our key yet. So now we look for L now we look for

another L and then finally we find the S and when we find the S we look to see if

there's a value there. In this case there is a value and so

that's what we return the value associated with the last key character.

So that's a typical search hit in a tri. Another way is...

So, that node was down at the bottom but if you had.

We're doing a search for SHE, you follow the same path.

But now when you get to the E, that's the last character of that key and so we just

know how to return that, value associated with that node.

That is, the search doesn't have to go all the way to the bottom.

It might terminate at an intermediate node.

And we just return the value associated with the node corresponding to the last

character in the key. What about a search miss?

Well, there's two cases. It's for a search miss.

So one of them is we've followed down the tree picking off the letters in the key

one letter at a time as, as before. And then when we get to the end of the key

we take a look to see if there is a value. In this case there's no value associated

with the last key character that mean there's no key in the data structure

that's been associated with a value. So we just return no.

Or, the other thing that can happen is, we can, reach a null link.

So, our key now is S, starts with S, and then H, and then E, and then L, and now

our next letter is T, so we look to see if there's a non null link coming from this

node associated with T. And in this case, there's no such link, so

we return null. That's evidence that, that string, there's

no key associated with that string in our data structure.

So that's a search miss. Iii.

Now, what about insertion? Well insertion is also pretty simple.

So we follow two, two rules. One is again, we go down links

corresponding each character in the key. If we come out on a link, we create a new

node that's associated with the character that we're on.

And just keep doing that until we get to the last character of the key, in which

case, we put the value. So if in this try, we're supposed to

associate the value seven with [inaudible].

We follow, our path from the root to s, 'cause our first letter is s, and then

h.'Cause our next letter is h. And then a O, we had an, [inaudible] kinda

ran into a null link. So we have to put, a node there, with the

character, associated with the character O.

So in later searches for this key, we'll be able to find it by passing through that

node. And now we move to the next letter, and

there's no node again. In the next letter, there's still no node.

Iii. When we get to the end, then that's the

last character in the key, and we put our value seven.

And now we've modified the data structure, adding the nodes that are necessary for a

later search to go through, character by character and find the value associated

with that key. So that's an insertion into a try.

Just to cement all these ideas let's do a demo of how that tree was constructed from

scratch. So we have a null try.

So we're going to start by putting the, associating the value zero with the string

SHE. So we start at the root node, which, and

I'll try, it just has that one node. Create a new node for S, create a new node

for H, create a new node for E. Associate zero.

So the key is the sequence of characters from the root to the value, and the value

is in the node corresponding to the last character.

This try represents the symbol table that contains just on string SHE, and

associated with zero. Every other string, if you search for any

other string in this try you'll either end in a node that doesn't have a value or

you'll end at a null link. All right, so now, let's say we put in the

key S, E, L, L, S and you can kinda tell where it's going, we made room for it in

the try. So we go for S, and now the second letter,

E, corresponds to a null link, so we create a new node.

And similarly we go through and create new nodes for each of the remaining characters

in the key and then associate the value one at the end.

So now we have a try that has two keys in it, SHE and SELLS.

Now our next is to insert SEA. So now we have S, we already have a node

corresponding to E. And so now we just have to create one new

node, the one corresponding to A. And put our value two there.

And now, we're going to put SHELLS in. So we already have the S, already have the

H, already have the E, so now we have to add nodes for the last three letters in

that string. And associate the value three with it.

Now we're going to put finally a key that doesn't start with S.

So that means we create a new node corresponding to the first letter of that

string, I, N to the other letter two. And then associate the value four.

And here's another one that doesn't start with s.

So we create new nodes corresponding to each one of its characters, and associate

the value five with the last one. Now here's SEA.

So this is, remember, our symbol table API is associative, which means that if we get

a new value for a key that's already in the table, we overwrite.

The old value with the new value. That's the way, the convention that we've

chosen for symbol tables. So that is easily done with tries as well.

Now here's one more key, S, H. And now we have to add a new node because

there's no node that's a child of H that corresponds letter O.

So we have to create new nodes for O, R, and E, and associate the value seven with

it. So that's a tri that has precisely eight

keys. If you look for any one of those eight

keys you'll get the value. If you look for any other string you'll

either come to a node that has a null value or go out on a null link and so the

search would be unsuccessful. That's is a demo of tri construction.