Just as in the last lecture,

we want to make the point that performance matters in these implementations.

All sorts of the applications we've considered involve huge,

huge numbers of key value pairs.

We want to start by specking performance requirement.

So just as an example,

let's look at a benchmark application linguistic analysis.

There's a thing called Zipf's law for natural language

for a corpus or huge text in the natural language.

This is a theory that,

if the most frequent word occurs about t times in that text,

then the second most frequent occurs about t over two,

and the third about t over three and so forth.

So, linguists are interested in testing that law in real natural language text.

So that's the goal we want to validate that law.

Well, we can use our frequency analysis program

that we just considered as a simple table client.

But in order for that to complete for a huge text,

we need an efficient simple table implementation.

So, for example, here's the result of running this

for Moby-Dick which has about a million words.

And so, if we have an efficient implementation

but that one is going to be difficult to

do unless you have a good simple table implementation.

And actually nowadays people study much.

This is a validation of Zipf's law for Moby-Dick is not too far off.

You can see the observation versus the hypothesis and it's pretty close.

People actually study this for huge text nowadays.

So, there's a thing called for example,

the Wortschatz corpus from Leipzig project.

And that one might have 20 million words.

Maybe half a million of them distinct.

So and even nowadays if you look at texts on the web or big sets of text on the web,

you might even have billions.

So, you're not going to be able to study these

without an efficient symbol table implementation.

So, let's think about the data structures that

we've considered so far and see where we are.

So, one idea might be let's keep the keys in order in an array.

Keep the values in a parallel array.

So if we have the index of the key,

we have the index of the value.

And use it like we did in the sorting and searching lecture.

We have the fast sort mergesort.

And we have to remember whenever we move a key,

we have to move the associated value.

And we get fast search with binary search.

But one problem is how big do we make the array?

So, we talked about that in the sorting searching lecture.

But there's really a fatal flaw in this even if you

address that one is that how do you insert a new key?

If you want to keep the key array in order when you get a new string,

then you have to move the larger entries over one or down one,

in the same way as with this insertion sort.

And that's too slow.

And you can easily see that in our benchmark it's going to take quadratic time.

Every time you get a new key,

quadratic in the number of distinct keys.

Because every time you get a new one on

average you're going to have to move about half down.

And you can run experiments to validate that.

But we're not going to take the time to do that right now.

It's clear from the model.

So, what about linked list? Would that work?

So, maybe the idea is to keep the keys in order in

a linked list and add a value field for each node.

And so, that's going to be good because same way as for stacks and queues,

it meets the memory use performance back.

But that one's got a fatal flaw and is,

how do you do this search?

With binary search, we needed that index to access into the array.

You need to be able to quickly get to the middle.

How do you get to the middle of a linked list?

Really the only thing you can do is search sequentially through

the list to get to the middle or to search for any specific key.

And again, that's going to be too slow for huge amounts of data.

Pretty easy to hypothesize that our benchmarks

are going to take quadratic time and it's just not going to

work for millions or billions of words.

And again you can easily validate that with experiments,

but we won't take the time to do so.

So, our design challenge is to,

of course we want our simple, safe,

clear, and efficient client code.

And we want that efficiency to mean that the symbol table is scalable.

That we can keep up with Moore's law with the symbol table.

So, now we're going to relax it slightly.

It's really a lot to expect to get constant running time for these kinds of operations.

But we'll relax to say it's got to be at least logarithmic.

So, since the log of a million is 20,

the log of a billion is 30,

it's only a little bit more costly than constant time.

And that's what we're going to say if you don't have that kind of performance,

then you don't have a symbol table.

And again, we want the memory usage to be

linear and the size of the collection just as for stacks

and queue is nonempty and no limits within the code on the collection side.

So, that's the challenge.

Can we really achieve such guarantees?

Can we implement associative arrays with just log factor extra costs?

That seems to be quite a difficult challenge and the naive programmer might say,

there's no way that you can do that.

You can search through a billion keys just looking at

30 things and allow arbitrary insertion.

And that's what this lecturer is about.

Yes, we can achieve that guarantee and that's what we're going to look at next.