0:04

Next we're going to look at Ternary search tries, which is another data

structure that new data structure, that responds to the problem of too much

memory space used by our way tries. this is a very easy to describe and

implement data structure. And it came out of the same paper that

Jon Bentley and I wrote in the 1990s when we developed the three way Ternary

quicksort. the idea is now we're going to actually

store characters in nodes in values. but, we're not going to store whole keys

in there, just characters. but we're not going to use this idea of a

big array, where a nominal link is an implicit value of a character, actually

going to store the characters. but what we're going to do is, make each

node have three children not two, like in a binary search tree.

And not r, like in a Nr way trie, but exactly three and the three children are

for the trie corresponding to smaller keys.

The trie corresponding to trees that start with the same, character in the tri

corresponding to larger trees. And just given this, this des,

description and experience with many of the data structures that now, we've

talked about already in this course. many of you could go off and implement

this data structure, and we'll see how simple the implementations are.

2:38

now, but for the left and right links those are for keys that start with some

letter that appears before S on the left, and after S on the right.

and then say, on the left, the node for B is, if you go down this link, it's all

the keys that start with B. Otherwise the B is just to divide the

ones that are less, from the ones that are bigger, and so forth.

and again, if you [COUGH] search for a key by say, look search for SHE, so

that's going down middle links and matching keys.

Then some nodes will have values associated with them, that's the value

associated with the last letter, last character in the search key.

So, that's what the data structure looks like.

and I'd begin to describe the search algorithm, it's quite simple given the

definition of the data structure. Again, we follow links corresponding to

each character of a key. If we have a character that's less than

the character in the node, we go left, if we have one that's greater, we can go

right. and if it's equal we move to the next

character, and take the middle link. so this is an example of a search for a

SEA. Start with S, that's a match, we take the

middle link. Now the next character is E that's a

mismatch and E is less than 8, so we go to the left.

4:16

now we find a note that has E, and we didn't update our pointer into the key

character, because we didn't find a match.

So, we're still looking at the E, and now, now we can match that E, so we go

down the middle link. Now we're looking at the A in the search

key, and that's less than L, so we go to the left.

now we're still looking at the A in the search key, and that's a match and it's

the last character in key, so, we return, the value associated with the last

character in the key. So, it's the same basic algorithm, that

we used for trie's except, we just have a different way to decide whether we match

the character. We explicitly store characters in nodes.

We follow middle link on a match, and update the pointer in our key character.

Otherwise, we, we follow the left or right link, because we know that the key

is going to be found in the left or right sub-tree.

And each node has exactly three links. So here's a example of search in a TST,

again this is the example I just talked about in a dynamic form, form.

So, if S is the first character in the key, matches the S in the first node of

tree, so we take the middle link, and move on to the second character in the

key which is e. Compare that against h, it's not a match

in fact it's less, so we go left. So, now we're still looking at e and c,

and it's a match with the character in the node that we're currently processing.

So, we take the middle, and now we move to the next character in the key which is

a. And now we take a compared to l, and a is

less, so we go left, we're still looking at the a, didn't updated it, 'cuz it's

not a match, and now it is a match. and it's the last character in the key,

so we look at the value, and that's the value we return.

what about unsuccessful search? well, let's see how that works.

So, for shelter, it starts with an S, so we take the middle link, second, and move

to the second character. Second character's an h, and that's a

match. So, we take the middle link, and move to

the next character. Third character's an e and third

character, and that, that's also a match. So again, we take the middle link, and

move to the next character. the fourth character's an l, that's also

a match. So, again, we take the middle link, and

move to the next character. Now the next character is a t, which is

not a match, so we want to move to the right but moving to the right takes us to

a null link. So that means that shelter is not in the

TST and, and so we return null. Again, in every, in every step what we do

is completely well-defined. and any search for a key that in the TST,

it's going to return it's associated value.

Any search for any other key, is either going to, go out on a null link, or end

on a node that involves a mismatch. so, with that basic idea, let's take a

more detailed look in the demo, of how the tree gets constructed.

and following through this demo, will give you a pretty good feeling for how

these trees are constructed. So, we're going to associate the value 0

with a string, SHE. so, again every time we move to a new

letter, we have to create a new node, until we get to the end of the key.

And at which case, we put the, associated the value with the node that contains the

last character in the key. so, that's the starting point.

key is a sequence of characters down the middle links that [COUGH] goes to from

the root to some node. And the value is in the node that

corresponds to the last character. so, how do we put a new one in this tree

so the in this Ternary search trie. Our first character is S, so we go down

the middle. Our next character is not a match so

[COUGH] we go to the left, and it's a null link but we have a node that we want

to put there. That's the one corresponding to the

second letter, in sell's so we do that, and then we make nodes with middle links,

going until we get to the last character in sell's, and we associate that with the

value 1. So, what about SEA, let's see how that

one works. so, first character is S, so I take the

middle link, second character is e which is less than h and not a match so we move

to the left and continue looking for the e.

now this node we find the e, so we take a middle link, and move to the next node

[COUGH] next character which is a, a is less than l.

it's not a match so we go to the left, that's a null linl, and that's where we

put our a, and we associate it with the value 2.

OK, here's sh, shells so we have a match at s, go to the middle link, and go to

the next letter, match h middle link next letter.

and then we go down and add the three nodes corresponding to the last three

letters and put a three at the node corresponding to the last letter.

OK? now B Y we look at S not a match it's

less, so we're going to go to the left, that's a null link.

That's where we put our first character, and then we go down the middle link to

add the node for the last character,and then ah,associate the value for there.

and then B is a similar thing on the right.

And then go for the t and h and e, and that gets our share with the value 5.

10:37

SEA again, it's associative, so we do the search, so S is a match.

So we go down the middle link, and move to the next letter.

E is less than h so we go to left, and keep looking for e.

Now we find e is a match, so we go down the middle link, and move to the next

letter. that's a, which is less than l, so we go

to the left to look for the a, and there we find it.

and we're [COUGH] building an associate symbol table, so we update and overwrite

the old value with the new value, 6. S, H, O, R, E.

we match the S, match the h, now we're on O, that does not match e, so we go to the

right, and keep looking for O. Now, we don't find it, so we create a new

node for O, R, E, and then put the value 7 in the values associated with the last

character in that key. so, that's the construction of a Ternary

search tree containing eight keys. Let's take a look at a slightly bigger

Ternary search trie. so here's our example with three letter

words. now for that example, we didn't show all

the null links, but there's lots of null links, 26 null links in each leaf for

this 26 way trie. And there's other null links higher in

the tree, and actually counting up there's over a thousand null links in

this tiny 26 way trie. But in a TST, it's got about the same

number of nodes it's actually got more nodes, because it's got one for each

character. And the other ones correspond to links,

but not that many more nodes. But the key thing is, it has many fewer

null links, because there's only three links per node.

And this effect is obviously going to be much more dramatic when R is much bigger.

like a 256 way trie or even a unicode 65,000 way trie.

that's the key benefit of TSTs, that and they're much more space-efficient, then

our way trie's for large R. so, let's take a look at the

implementation in Java, and then we'll look at the code, and then we'll talk

some more about performance. oh, the representation in Java is very

straightforward, as I said many of you could code this up, just from the

description. what does a node have to have?

It's going to have a value, it's going to have a character.

it's going to have references to left, middle, and right TSTs.

and that's what it's built from. so whereas the standard trie

representation has a ray of links with characters implicit.

TST has explicit characters with only three links per node.

so that's TST representation and then given the representation the code is

follows almost immediately. with our standard setup if we're supposed

to, now we use, we call a recursive routine, giving the root as first

argument. And give the tree returned going to give

a reference that back to the root. so our recursive routine takes a node and

returns a node and for most of the time, it, it doesn't do much.

But when it can do node, is created this this code is effective, because any link

we go down, we replace with a link that we got back, our standard setup.

so, the recursive port routine if we're that character d, we pull out the d

character. if once we've done that, if our node is

null, we create a new node and give it that character.

15:25

otherwise, we go either down the left or right link under the same circumstances,

without moving to the next character in the key.

so that's extremely straightforward code for put in Ternary search trie's.

and again, get is even simpler. it's a similar set up that we've used

several times before. Use a recursive routine that returns a

node, and we pull our value out of that node.

we don't have to cast in this case, cause we're not running arrays, and our nodes

just has the value type in it. and if we get a null node back, yea, we

return null that its not there, the recursive routine will pull out the

character if its less than our if our search character is less we go to the

left. And if its greater we go to the right,

and if its equal we and we're not at the end of the key, we go down to the middle

and we move to the end of the key. If its equal and we are at the end of the

key, we return the node. And even that node, is going to contain a

value or it's not, but that's what we'e return.

so, extremely straightforward implementation of both get and put for

Ternary search drives. so this, adds this line to our table,

where, the amount of space taken for Ternary search trie's is just three

lengths you know, plus the value character in each node.

so similar to what red-black BST and hashes would use, so no space

disadvantage. the determining the cost of search hits

and search miss, these values are given for random keys, and they're the subject

of deep analysis. but it's not too many.

Le, Let's, let's put it that way. And our experimental results bear that

out. So, for, for a search miss in a Ternary

search trie, you don't have to look at the whole key.

Usually, you look at maybe fewer characters than than the whole key, and

actually these results are, are for typical, or random.

but actually TSD's really shine, in the case where data is not random.

they very well adapt to the data. like text written in English language, is

not really language really random. And you can see that Ternary search trees

actually out perform both hashing and red black BSTs for our benchmark, which is

deduping these big text files. [COUGH] now you, you can go ahead and try

to get worst case guarantees. By using rotations to balance the the the

internal binary search tree's corresponding to each character in, in

TSTs. although that's probably normally not

worth the trouble. but the bottom line is that TST is at

least it's as fast as hashing for string keys and it's it's space efficient for

sure. And it's pretty easy to speed it up.

the idea is to speed it up, that works extremely well in practice is to, use our

way of branching, at the be, at the top of the TST.

so rather than fool around with the first, couple of characters, simply do a

big branching at the root. You can do r way branching at the root of

26, but in practice, it's pr, proven it works fine to do, let's say, r squared

way branching at the root. and immediately take care of the first

two characters, which is the common case, and then get down to small TSTs.

and there's variations of this, that you might imagine, but this is a really

simple one that almost always works. you have to deal especially with one and

two letter words. and we'll leave that for exercise.

but with that change now we get a, an algorithm that's is definitely faster

than hashing in red-black BSTs for our benchmarks.

and that'll turn out to be the case, for many applications.

So, just to summarize what are the trade offs between TSTs and hashing?

for hashing you have to examine the entire key, in order to compute the hash

function. we talked about examples, where people

try to skip doing that, and wind up with performance problems.

Not much difference between a hit and a miss in a hashing, and there's a lot of,

of reliance and maybe difficulty of implementation ah,depending on the hash

function. and also, hashing is really only good for

search and insert. the other kinds of operations that we'd

like to perform in symbol table when there's order in the keys is not

supported by hashing. with TSTs, it only works for strings or

if you have keys that are numbers, then you can rework them into strings, or

string like things. And that's that's a restriction but does

cover a huge number of applications. it, it only examines the key characters

that it needs to, and in particular, a search miss might involve only a few

characters. So, in typical applications you'll get

sub-linear performance in TSTs. You can get searches done without looking

at the whole key. the other thing is there's other, the

ordered symbol table operations you can, you know easily support with TSTs.

just by extending, what we did for binary search tree.

and even more important there's other interesting operations that we can add

for string keys. that are very useful, and we'll see

applications, and implementations of these, in just a minute.

so the bottom line is that this data, is a relatively new data structure.

but, its better than the classic ones for important applications.

Its faster than hashing particular for search misses.

and its got more flexibility even than red black BSTs, and that's what we will

talk about soon, but that's an introduction to TSTs.