0:00

So, in this video, we're going to start reasoning about the performance of hash

tables. In particular, we'll make precise this idea that properly implemented they

guarantee constant time lookup. So, let me just briefly remind you of the road map

that we're in the middle of. So, we observed that every fixed hash function is

subject to a pathological data set and so exploring the solution of making a real

time decision of what hash function to use. And we've already gone over this

really quite interesting definition of universal hash functions and that's the

proposed definition of a good random hash function. More over, in the previous video

I showed you that there are simple such families of hash functions. They don't

take much space to store, they don't take much time to evaluate. And the plan for

this video is to prove formally, that if you draw a hash function uniformly at

random from a universal family of hash functions, like the one we saw in the

previous video, then you're guaranteed expected constant time for all of the

supported operations. So, here's the definition once more of a universal family

of hash functions. We'll be using this definition, of course, when we prove that

these hash functions give good performance. So, remember, we're talking

now about a set of hash functions. These are all of the conceivable real time

decisions you might make about which hash function to use. So, the universe is

fixed, this is something like IP addresses, the number of buckets is fixed.

You know that's going to be something like 10,000, say, and what it means for a

family to be universal is that the probability that any given pair of items

collides is as good, as small as with the gold standard of completely perfect

uniform random hashing. That is for each pair xy of distinct elements of the

universe, so for example, for each distinct pair of IP addresses, the

probability over the choice of the random hash function little h from the family

script h is no more than one over n, where n is the number of buckets. So, if you

have 10,000 buckets, the probability that any given pair of IP addresses winds up

getting mapped to the same bucket is almost one in 10,000. Let me now spell out

the precise guarantee we're going to prove if you use a randomly chosen hash function

from a universal family. So, for this video, we're going to only talk about hash

tables implemented with chaining, with one length list per bucket. We'll be able to

give a completely precise mathematical analysis with this collision resolution

scheme. We're going to analyze the performance of this hash table in

expectation over the choice of a hash function little h drawn uniformly from a

universal family script h. So, for example, for the family that we

constructed in the previous video, this just amounts to choosing each of the four

coefficients uniformly at random. That's how you select a universal, that's how you

select a hash function uniformly at random. So, this theorem and also the

definition of universal hash functions dates back to a 1979 research paper by

Carter and Wegman. The idea of hashing dates back quite a bit before that,

certainly to the 50s. So, this just kind of shows us sometimes ideas have to be

percolating for awhile before you find the right way to explain what's going on. So,

Carter and Wegman provided this very clean and modular way of thinking about

performance of hashing through this universal hashing definition. And the

guarantee is exactly the one that I promised way back when we talked about

what operations are supported by hash tables and what kind of performance should

you expect, you should expect constant time performance. As always, with hashing,

there is some fine print so let me just be precise about what the caveats of this

guarantee are. So, first of all, necessarily this guarantee is an

expectation. So, it's on average over the choice of the hash function, little h. But

I want to reiterate that this guarantee does hold for an arbitrary data set. So,

this guarantee is quite reminiscent of the one we had for the rand omized quick sort

algorithm. In Quicksort, we made no assumptions about the data. It was a

completely arbitrary input array and the guarantee said, on average over the real

time randomized decisions of the algorithm, no matter what the input is,

the expected running time was in log in. Here we're saying again, no assumptions

about the data. It doesn't matter what you're storing in the hash table and

expectation over the real time random decision of what hash function you use,

you should expect constant time performance, no matter what that data set

is. So, the second caveat is something we've talked about before. Remember, the

key to having good hash table performance, not only do you need a good hash function

which is what this universality key is providing us but you also need to control

the load of the hash table. So, of course, to get constant time performance, as we've

discussed, a necessary condition is that you have enough buckets to hold more or

less the stuff that you're storing. That is the load, alpha, the number of objects

in the table divided by the number of buckets should be constant for this

theorem to hold. And finally, whenever you do a hash table operation, you have to in

particular invoke the hash function on whatever key you're given. So, in order to

have constant time performance, it better be the case that it only takes constant

time to evaluate your hash function and that's, of course, something we also

discussed in the previous video when we emphasized the importance of having simple

universal hash functions like those random linear combinations we discussed in the

previous video. In general, the mathematical analysis of hash table

performance is a quite deep field, and there is some, quite mathematically

interesting results that are well outside the scope of this course. But what's cool,

in this theorem I will be able to provide you a full and rigorous proof. So, for

hash tables with chaining and using randomly chosen universal hash functions,

I'm going to now prove that you do get cons tant time performance. Right, so hash

tables support various operations, Insert, Delete and Lookup. But really if we can

just bound a running time of an unsuccessful lookup, that's going to be

enough to bound the running time of all of these operations. So, remember in hash

table with chaining, you first hash the appropriate bucket and then you do the

appropriate Insert, Delete or Lookup in the link list in that bucket. So, the

worst case as far as traversing though this length list is if you're looking for

something but it's not there cuz you have to look at every single element in the

list and followup into the list before you can conclude that the lookup has failed.

Of course, insertion, as we discussed, is always constant time, deletion and

successful searches, well, you might get lucky, and stop early before you hit the

end of the list. So, all we're going to do is bother to analyze unsuccessful lookups

that will carry over to all of the other operations. So, a little more precisely,

let's let s be the data set. This is the objects that we are storing in our hash

table. And remember that to get constant time lookup, it really needs to be the

case that the load is constant. So, we are assuming that the size of s is bigger over

the number of buckets n. And let's suppose we are searching for some other object

which is not an s, call it x. And again, I want to emphasize we are making no

assumptions about what this data set s is other than that the size is comparable to

the number of buckets. So, conceptually doing a lookup in a hash table with

chaining is a very simple thing. You just hash to the appropriate bucket and then

you scan through the length list in that bucket. So, conceptually, it's very easy

to write down the what the running time of this unsuccessful lookup is. It's got two

parts. So, the first thing you do is you evaluate the hash function to figure out

the right bucket. And again, remember we're assuming that we have a simple of a

hash function and it takes constant time. Now, of course, the magic of hash

functions is that given that this hash value, we can zip right to where the

lenght list is to search for x using random access into our array of buckets.

So, we go straight to the appropriate place in our array of buckets and we just

search through the list ultimately failing to find what we're looking for s.

Traversing a link list, as you all know, it takes time proportional to the length

of the list. So, we find something that we discussed informally in the past which is

that's the running time of hash table operations implemented with chaining is

governed by the list lengths. So, that's really the key quantity we have to

understand. So, lets call this, lets give this a name, Capital L, it's important

enough to give a name. So, what I want you to appreciate at this point is that this

key quantity, Capital L, the list of the length in x's bucket is a random variable.

It is a function of which hash function little h, we wind up selecting in a real

time. So, for some choices of our hash function, little h, this list length might

be as small as zero but for other choices of this hash function h, this list, list

length could be bigger. So, this is exactly analogous too in Quicksort where

depending on the real time decision of which random pivot element you use, your

going to get a different number of comparisons, a different running time. So,

the list length and hence the time for the unsuccessful storage, depends on the hash

function little h, which we're choosing at random. So, let's recall what it is we're

trying to prove. We're trying to prove an upper bound on the running time of an

unsuccessful lookup on average, where the on average is over the choice of the hash

function little h. We've expressed that this lookup time in terms of the average

list length in x's bucket where again the average is over the random choice of h.

Summarizing, we've reduced what we care about, expected time for lookup to

understanding the expected value of this random variable Capital L, the average

list length in x's bucket. So, that's what we've got to do, we've got to compute the

expected value of this random variable, Capital L. Now to do that, I want to jog

your memory of a general technique for analyzing expectations which you haven't

seen in awhile. The last time we saw it, I believe, was when we were doing the

analysis of randomized Quicksort and counting its comparisons. So, here's a

general decomposition principle which we're now going to use in exactly the same

way as we did in Quicksort here to analyze the performance of hashing with chaining.

So, this is where you want to understand the expect, expectation of random variable

which is complicated but what you can express as the sum of much simpler random

variables. Ideally, 0,1 or indicator random variables. So, the first step is to

figure out the random variable, Capital Y, it's what I'm calling it here that you

really care about. Now, we finished the last slide, completing step one. What we

really care about is Capital L, the list length in x's bucket. So, that governs the

running time a bit unsuccessful Look up, clearly that's what we really care about.

Step two of the decomposition principle is well, you know, the random variable you

care about is complicated, hard to analyze directly but decompose it as a sum of 0,1

indicator random variable. So, that's what we're going to do in the beginning of the

next slide. Why is it useful to decompose a complicated random variable into the sum

of 0,1random variables? Well, then you're in the wheelhouse of linear of

expectations. You get that the expected value of the random variable that you care

about is just the sum of the expected values of the different indicator random

variables and those expectations are generally much easier to understand. And

that will again be the case here in this theorem about the performance of hash

tables with chaning. So, let's apply this three-step-decomposition principle to

complete the proof of the Carter-Wegman theorem. So, for the record, let me just

remind you about the random variable that we actually really care about, that's

Capital L. The reason that's a random variable is that because it depends on the

choice of the hash function, little h. This number could vary between zero and

something much, much bigger than zero, depending on which each you choose. So,

this is complicated, hard to analyze directly, so let's try to express it as

the sum of 0,1 random variables. So, here are the0,1 random variables that are going

to be the constituents of Capital L. We're going to have one such variable for each

object y in the data set. Now, remember this is an unsuccessful search, x is not

in the data set Capital S. So, if y is in the data set, x and y are necessarily

different. And we will define each random variable z sub y, as follows. We'll define

it as one if y collides with x under h and zero otherwise. So, for a given zy, we

have fixed objects x and y So, x is some IP address, say, y is some distinct IP

address, x is not in our hash table, y is in our hash table. And now, depending on

which hash function we wind up using, these two distinct IP addresses may or may

not get mapped to the same bucket of our hash table. So, this indicates a random

variable just indicating whether or not they collide, whether or not we unluckily

choose a hash function little h that sends these distinct IP addresses x and y to

exactly the same bucket. Okay, so, that's zy, clearly by definition, it's a 0,1

random variable. Now, here's what's cool about these random variables is that

Capital L, the list length that we care about decomposes precisely into the sum of

the zy's. That is, we can write capital L as being equal to the sum over the objects

in the hash table of zy. So, if you think about it, this equation is always true, no

matter what the hash function h is. That is if we choose some hash functions that

maps these IP address x to, say, bucket number seventeen, Capital L is just

counting how many other objects in the hash table wind up getting mapped to

bucket number seventeen. So, maybe ten different ob jects got mapped to bucket

number seventeen. Those are exactly the ten different values of y that will have

their zy equal to1, right? So, so l is just counting the number of objects in the

data set s that's collide with x. A given zy is just counting whether or not a given

object y in hash table is colliding with x. So, summing over all the possible

things that could collide with x, summing over the zy's, we of course get the total

number of things that collide with x which is exactly equal to the number, the

population of x's bucket in the hash table. Alright, so we've got all of our

ducks lined up in a row. Now, if we just remember all of the things we have going

for us, we can just follow our nose and nail the proof of this theorem. So, what

is it that we have going for us? Well, in addition to this decomposition of the list

length in to indicate random variables, we've got linear expectation going for us,

we've got the fact that our hash function is drawn from a universal family going for

us. And we've got the fact that we've chosen the number of buckets and to be

comparable to the data set size. So, we want to use all of those assumptions and

finish the proof that the expected performance is constant. So, we're going

to have a few inequalities and we're going to begin with the thing that we really

care about. We care about the average list length in x's bucket. Remember, we saw in

the previous slide, this is what governs the expected performance of the lookup. If

we can prove that the expected value of capital L is constant, we're done, we've

finished the theorem. So, the whole point of this decomposition principle is to

apply linear of expectation which says that the expected value of a sum of random

variables equals the sum of the expected values. So, because L can be expressed as

the sum of these zy's, we can reverse the summation and the expectation and we can

first sum over the y's, over the objects in the hash table and then take the

expected value of zy. Now, something which came up in our Quicksort an alysis but

which you might have forgotten is that 0,1 random variables have particularly simple

expectations. So, the next quiz is just going to jog your memory about why 0,1

random variables are so nice in this context. Okay, so the answer to this quiz

is the third one, the expected value of zy is simply the probability that x and y

collide, that just follows from the definition of the random variable zy and

the definition of expectation, namely recall how do we define zy. This is just

this one, if this object y in the hash table happens to collide with the object x

that we are looking for under the hash function x and zero otherwise, again, this

will be, this will be one for some hash functions and zero for other hash

functions. And then we just have to compute expectations. So, one way to

compute the expected value of a 0,1 random variable is, you just say, well, you know,

there are the cases where the random variable evaluates to zero and then

there's the cases where the random variable evaluates to one, and of course

we can cancel the zero. So, this just equals the probability that zy = one. And

since zy being one is exactly the same thing as x and y colliding, that's what

gives us the answer. Okay, so it's the probability that x collides with y. So,

plenty of that into our derivation. Now, we again have the sum of all the objects y

in our hash table and the set of the expected value of zy what's right that in

the more interpretable form, the probability that this particular object in

the hash table y collides with the thing we are looking for x. Now, we know

something pretty cool about the probability that a given pair of distinct

elements like x and y collide with each other. What is it that we know? Okay, so I

hope you answered the second response to this quiz. This is really in some sense

the key point of the analysis. This is the role, that being a universal family of

hash functions plays in this performance guarantee. What does it mean to be

universal? It means for any pair of objects distinct like x and y in that

proof, if you make a random choice of a hash function, the probability of

collision is as good as with perfectly random hashing, hashing. Namely at most,

1/ n where n is the number of buckets. So, now I can return to the derivation. What

that quiz reminds you is that the definition of a universal family of hash

functions guarantees that this probability for each y is at most 1/n, where n is the

number of buckets in the hash table. So, let's just rewrite that. So, we've upper

bounded the expected list length by a sum over the objects in the data set of 1/n.

And this, of course, is just equal to the number of objects in the data set, the

[inaudible] of s divided by n. And what is this? This is simply the load, this is the

definition of the load alpha which we are assuming is constant. Remember, that was

the third caveat in the theorem. So, that's why as long as you have a hash

function which you can compute quickly in constant time. And as long as you keep the

load under control so the number of buckets is commensurate with the size of

the data set that you're storing. That's why, universal hash functions in a hash

table with chaining guarantee expected constant time performance.