0:03

So, the last step is to study our implementation for the kinds of

applications that we face and see what happens with the performance.

This is more complicated than other things that we've considered

because the cost depends on something pretty complicated.

The order in which the keys are inserted.

We looked at a typical case which is words that are coming off a text.

The best thing that could happen would be the trees perfectly balance.

But there's also a bad thing that can happen.

If the keys come in in order,

then it's no better than a linked list.

And this is a situation that has to be reconciled when working with binary search trees.

But when the keys come in in random order,

the tree stays roughly balance,

and it tends to stay that way as you can see from this simulation.

This is 255 random keys getting inserted into a binary search tree.

It's not perfectly balanced,

but it's pretty well-balanced.

And in plenty of applications like the words in the text,

the model that the keys are coming in random order is not necessarily so bad.

But there's lots of applications where we get pretty well-balanced trees in this way.

So, let's look at that situation.

So, let's take a model where we say we're going to insert the keys in random order.

Well, it's possible to prove that if the keys are randomly ordered,

the cost of building a binary search tree with those keys is

about 2 N natural log N or 1.39 N log base two of N compares.

If it were perfectly balanced,

it would be exactly N log base two of N. So,

it's only about 40 percent worse than perfectly balance,

if the keys are randomly ordered.

This is an interesting exercise in discrete math.

It's beyond our scope,

but it's right in the sweet spot for a course on algorithms.

So, with randomly ordered keys,

we get our logarithmic performance,

the time to search or insert any key in that tree,

it's going to be proportional to log N. So,

we can say that, first of all we implement the API exactly.

For random keys, we get

our logarithmic performance and we'll talk about improving that in a second.

We have the linear memory use and

we have no limits within the code on the collection side.

And that's all made possible by the binary tree data structure.

So that's a pretty good outcome and we can run

empirical tests to try to validate that mathematical analysis.

And so we do frequency count without the output just for our empirical tests and run it,

and we can definitely do frequency count for large numbers of keys.

And our doubling ratio is about two,

which means that now we can even do a billion in a couple of minutes.

And that confirms the hypothesis that this method scales for this kind of application,

which is way better than we're able to get with

elementary arrays or linked list implementations that were quadratic and did not scale.

So, pretty easy to do

that 21 million word corpus and easy to do even a billion things off the net.

And again, without this kind of algorithm,

it's not possible to address this problem.

So, it's definitely a contribution of the algorithmic technology to

give us the efficient performance that we have for so many symbol table implementations.

Now in practice, it could be that the keys are not randomly ordered,

and the running time would be quadratic,

and in fact this happens in practice.

People try to work with keys that are already in sorted order for example.

So, in an industrial strength implementation,

BSTs are not going to work.

But remarkably, we have algorithms that are not that much

more complicated than binary search trees called balance trees,

that perform simple transformations that guarantee the balance.

And this has been known for a long time since the '60s,

and right after graduate school,

a fellow graduate student and I developed an implementation called

Red-black trees that are found in most modern systems including Java.

The Java symbol table is built on Red-black trees.

Here's a simulation of Red-lack tree for the 255 keys.

So, the idea is that some of the links are red and some are black.

And there's a restriction that there's never two red links in a row.

And just maintaining that restriction,

end of this number of black links from the route to the bottom is always the same.

Just maintaining that restriction turns out to be not so difficult to do,

and the trees are very nearly balanced,

almost exactly a log N and it's guaranteed to stay that way.

So, this really does implement

the performance benchmark no matter what order the keys come in,

we can guarantee logarithmic performance for all the symbol table operations.

And not only that, the keys are,

it's based on ordering in the keys,

so we can do more sophisticated extensions to the basic API,

like finding the [inaudible] largest keys and other things like that.

So, and that's again what's used in Java's Treemap library.

So, this is implementing our API using Red-black trees,

using Java's library, just slightly different names for the methods.

And so again, it's possible to prove that all the operations are

guaranteed to use fewer than two log base two N compares,

which is really an amazing result when you think about it.

So, proving this is definitely adds stuff for an algorithms course,

and you'll certainly learn that if you take an algorithms course.

But our summary is,

we have with BSTs are pretty simple symbol table implementation that's usually efficient.

There's an alternative that we haven't talked about called

Hashing which is a bit more complicated,

and it can be efficient.

It doesn't support ordered operations,

and it's got other different performance characteristics.

That again, you can learn about in the course on algorithms.

With Red-black trees, that's a variant that's guaranteed to be efficient.

So, it's industrial strength implementation of symbol tables,

and there's many, many, many applications that are enabled by efficient symbol tables.

So, our basic question at the beginning,

can we implement associative arrays with just a log factor extra costs?

The answer is definitely yes.

And Alice, who took our algorithms course is quite convinced to that.

And so, when you think about what that means is,

if you manage to get a trillion customers,

which might be difficult,

you could search among on them in less than 80 compares.

That's kind of amazing when you think about it,

or even if you had all the atoms in the universe,

it would only take 200 compares and we can guarantee that.

That's truly awesome in Bob's language.