0:00

In the last video, we discussed the performance of hash tables that are

implemented using chaining, using one link list per bucket. In fact, we proved

mathematically that if you use a hash function chosen uniformly at random from a

universal family, and if you keep the buckets, number of buckets, comparable to

the size of your data set, then in fact, you're guaranteed constant time expected

performance But, recall that chaining is not the only implementation of hash

tables. There's a second paradigm which is also very important called open

addressing. This is where you're only allowed to store one object in each slot,

and you keep searching for an empty slot When you need to insert a new object into

your hash table. Now it turns out it's much harder to mathematically analyze hash

tables implemented using open addressing But, I do want to say a few words about it

to give you the gist of what kind of performance you might hope to expect from

those sorts of hash tables. So recall how open addressing works. We're only

permitted to store one object in each slot So this is unlike the case with chaining

where we can have an arbitrarily long list in a given bucket of the hash table. With

at most one object per slot, obviously open addressing only makes sense when the

load factor alpha is less than one. When the number of objects you're storing in

your table is less than the number of slots available Because of this

requirement we have at most one object per slot we need to demand more of our hash

function. Our hash function might ask us to put a given object, say with some IP

address into say bucket number seventeen but bucket number seventeen might already

be full, might already be populated. In that case, we go back to our hash function

and ask it where to look for an empty slot next. So maybe it tells us to next look in

bucket 41. If 41 is full it tells us to look in bucket number seven and so on Two

specific strategies for producing a probe sequence that we mentioned earlier were

double hashing and linear probing. D ouble hashing is where you use two different

hash functions, h1 and h2. H1 tells you which slot in which to search first and

then every time you find a full slot you add an increment which is specified by the

second hash functions h2. Linear probing is even simpler you just have one hash

function that tells you where to search first and then you just add one to the

slot until you find an empty slot As I mentioned at the beginning, it is quite

nontrivial to Mathematically analyze the performance of hash tables, using these

various open addressing strategies. It's not impossible. There is some quite

beautiful and quite informative theoretical work. That does tell us how

hash tables perform But that's well outside the scope, of this course. So

instead what I wanna do is I want to give you a quick and dirty calculation. That

suggests, at least in an idealized world. What kind of performance we should expect

from a hash table with open addressing If it's well implemented As a function of the

load factor, alpha. Precisely, I'm going to introduce a heuristic assumption. It's

certainly not true but we'll do it just for a quick and dirty calculation, that

we're using a hash function in which each of the n-factorial possible probe

sequences is equally likely. Now, no hash function you're ever going to use is

actually going to satisfy this assumption, and if you think about it for a little

bit, you realize that if you use double hashing or linear programming, you're

certainly not going to be satisfying that assumption. So this will still give us a

kind of best case scenario against to which you can compare the performance of

your own hash table implementations. So if you [inaudible] hash table, and you're

seeing performance as good, as what's suggested by this idealized [inaudible]

analysis, then you're home free. You know your hash table is performing great. So

what is the line in the sand that gets drawn, under this heuristic assumption?

What is this idealized, idealized hash function performance? As a function of the

lo ad alpha Well here it is. What I'm gonna argue next is that under this

heuristic assumption, the expected amount of time to insert a new object into the

hash table, is going to be essentially one over one minus alpha, where alpha is the

load. Remember the load is the number of objects in the hash table divided by the

number of available slots. So if the hash table is half full, then alpha's going to

be.5. If it's 75 percent full then alpha's going to be three-fourths. So what this

means is that, in this idealized scenario, if you keep the load pretty under control.

So, say if the load is 50%, then the insertion time is gonna be great, right?

If alpha's.5 And 1/ (1-alpha) =two, so you expect just two probes before the

successful insert of the new object And of course, if you're thinking about lookup,

that's going to be at least as good as insert, so if you're lucky a lookup might

terminate early if you find what you are looking for. In the worst case you go all

the way until an empty slot in an unsuccessful search, and that's gonna be

the same as insertion. So if alpha is small bounded away from one, you're

getting constant time performance. On the other hand, as the hash table gets full,

as alpha gets close to one, this operation time is blowing up; it's such a going to

infinity as alpha gets close to one. So if you need to have a nice. 90 percent full

hash table with open addressing. You're gonna start seeing, ten probes. So, you

really wanna keep hash tables with open addressing. You wanna keep the load under

control Certainly no more than probably.7. Maybe even less than that To refresh your

memory, with chaining, hash tables are perfectly well-defined even with loads

factors bigger than one. What we derived is that under universal hashing, under a

weaker assumption, we had an operation time of one plus alpha, for a load of

alpha. So with chaining, you just gotta keep alpha, you know, at most, some

reasonably small constant with open address, and you really got to keep it

well bounded a way below one. So next let's understand why this observation is

true. Why under the assumption that every probe sequence is equally likely do we

expect a one over one minus alpha running time for hash tables with open addressing?

So, the reason is pretty simple. And we can derive it by analogy with a simple

coin flipping experiment. So, to motivate the experiment, think just about the very

first probe that we do. Okay, so we get some new objects, some new IP address that

we want to insert into our hash table. Let's say our hash table's currently 70

percent full. Say there's 100 slots, 70 are already taken by objects. Well, when

we look at this first probe, by assumption it's equally likely to be any one of the

100 slots. 70 of which are full, 30 which are empty So, with probability of one

minus alpha, or in the case, 30%, our first Probe will, luckily, find an empty

slot and we'll be done. We'll just insert the new object into that slot If we get

unlucky with a probability, 70%. We find a slot that's already occupied and then we

have to try again. So we try a new slot, drawn at random And we again check is it

full, or is it not full? And again, with 30 percent probability, essentially it's

going to be empty and we can stop And if it's already full. Then we try, yet again.

So Doing random probes, looking for an empty slot, is tantamount to flipping a

coin with the probability of heads 1-alpha, or, in this example, 30 Percent

And the number of probes you need until you successfully insert is just the number

of times you flip this last coin until you see a heads. In fact this biased coin

flipping experiment slightly overestimates the expected time for insertions and the

heuristics assumptions and that's because in the insertion time whenever we're never

going to try the same slot twice. We're going to try all end buckets in some order

with each of the impact [inaudible] ordering equally likely So back to our

example, where we have a hash table with 100 slots, 70 of which are full. The first

probe indeed, we have a 30 in 100 chance of ge tting an empty slot. If that one

fails then we're not going to try the same slot again. So there is only 99 residual

possibilities. Again, 30 of which are empty. The one we checked last time was

full. So we actually have a 30 over 99 percent chance of getting an empty slot on

the second try. Like 30 over 98 on the third try, if the second one fails, and so

on But, a valid upper bond is just to assume a 30 percent success probability

with every single probe, and that's precisely, what this coin flipping

experiment will get us. So the next quiz will ask you to actually compute the

expected value of capital N, the number of coin flips, needed to get heads when you

have a probability of heads of one minus alpha. As a hint, we actually analyzed

this exact same coin flipping experiment when alpha equals a half, back when we

discussed the expected running time of randomized linear time selection. Alright,

so the correct answer is the first one. One over 1-alpha So to see why, let's

return to our derivation, where we reduced analyzing the expected insertion time to

this random variable. The expected number of coin flips until we see a heads. So,

I'm gonna solve this exactly the same way that we did it back when we analyzed a

randomized, selection algorithm. And it's quite a sneaky way, but very effective.

What we're going to do is we're going to express the expected value of capital N,

in terms of itself, and then solve. So how do we do that? Well on the left hand side

let's write the expected number of coin flips, the expected value of capital N,

and then let's just notice that there's two different cases, either the first coin

flip is a heads or it's not. So in any case you're certainly going to have one

coin flip so let's separate that out and count it separately. With probability

alpha, the first coin flip is gonna be tails and then you start all over again

And because it's a memory less process, the expected number of further coin flips

one requires, given that the first coin flip was tails, is just the same as the

expected number of coin flips in the first place. So now it's a simple matter to

solve this one linear equation for the expected value of N, and we find that it

is indeed one over one minus alpha, as claimed. Summarizing, under our idealized

heuristic assumption, that every single probe sequence is equally likely, the

expected insertion time is upper bounded by the expected number of coin flips,

which by this argument is, at most, one over one minus alpha. So, as long as your

load, alpha, is well bounded below one, you're good. At least in this idealized

analysis, you're hash table will, will work extremely quickly. Now I hope you're

regarding this idealized analysis with a bit of skepticism. Right, from a false

hypothesis you can literally derive anything you want. And we started with

this assumption which is not satisfied, by hash functions you're actually going to

use in practice. This heuristic assumption, that all probe sequences are

equally likely. So, should you expect this one over one minus alpha bound to hold in

practice or not? Well, that depends to some extent. It depends on what open

addressing strategy you're using. It depends on, how good a hash function

you're using. It depends on whether the data is pathological or not. So, just to

give course rules of thumb If you're Using double hashing and you have

non-pathological data, I would go ahead and look for this 1/1-alpha bound in

practice. So implement your hash table, check its performance as a function of the

load factor alpha and shoot for the 1/1-alpha curve. That's really what you'd

like to see. With linear probing, on the other hand, you should not expect to see

this performance guarantee of 1/1-alpha even in a totally idealized scenario.

Remember, linear probing is the strategy where your initial probe, the hash

function, tells you where to look first, and then you just skim linearly through

the hash table until you find what you're looking for, an empty slot, the. That

you're looking up or whatever So a linear probing, even in a best case scenario,

it's going to be subject to clumping. You're going to have contiguous Groups of

slots which are all full, and that's because of the linear probing strategy.

Now I encourage you to do some experiments with implementations to see this for

yourself. So because of clumping with linear probing, even in the idealized

scenario, you're not going to see one over one minus alpha. However, you're going to

see something worse, but still in idealized situations. Quite reasonable so

that's the last thing I want to tell you about In this video. Now needless to say,

with linear probing the heuristic assumption is badly false. The heuristic

assumption is pretty much always false to no matter what hashing strategy you're

using, but with linear programming it's quote on quote really false. So to see

that, the heuristic assumption, say that all in factorial probe sequences are

equally likely. So your next probe is going to be uniform or random amongst

everything you haven't probed so far but when you're probing, it's totally the

opposite. Right once you know the first slot that you're looking into say bucket

seventeen, slot a7 is gonna be the first slot, you know the rest of the sequence

because it's a linear [inaudible] cancel the table. So it's kind of [inaudible] the

opposite from each successive probe being independent from the previous ones except

not exploring things twice. So to state a conjectured or idealized performance

guarantee for hash tables with linear probing, we're going to place, replace the

blatant false heuristic assumption by a still false, but more heuristic reasonable

assumption. So what do we know? We know that the initial probe with linear probing

determines the rest of the sequence. So let's assume that these initial probes are

uniform at random, and independent for different keys. Of course, once you have

the initial probe, you know everything else, but let's assume independence and

uniformity amongst The initial probes. Now, this is a strong assumption. This is

way stronger than assuming you ha ve a universal family of hash functions. This

assumption is not satisfied Practice, but Performance guarantees we can derive under

this assumption are typically satisfied in practice by well implemented hash tables

that use linear probing. So, the assumption is still useful for deriving

the correct, idealized performance of this type of hash table. So what is that

performance? Well this is an utterly classic result from exactly 50 years ago

From 1962 And this is a result by my colleague, the living legend, Don Canuth,

author of Art of Computer Programming. At what can proved is, was that is that under

this weaker [inaudible] assumptions, suitable for linear probing. The expected

time to insert an object into a hash table with a load factor alpha, when you're

using linear probing is worse than one over one minus alpha, but. It is still a

function of the load alpha only and not a function of the number of objects in the

hash table. That is with linear programming you will not get as good a

performance guarantee, but it is still the case that if you keep the load factor

bounded away from one. If you make sure the hash table doesn't get too full you

will enjoy constant time operations on average so for example if with linear

probing your hash table is 50 percent full then you are going to get an expected

insertion time of roughly four probes. Note however this quantity does approach

does blow up pretty rapidly as the hash table grows full. If it is 90 percent full

this is already going to be something like a hundred probes on average. So you really

don't wanna let hash tables get too full when you are using linear probing. You

might well wonder if it's ever worth, implementing linear probing, given that it

has the worst performance curve, one over one minus alpha squared. Then the

performance curve you'd hope from something like double hashing, one over

one, minus alpha. And it's a tricky cost benefit analysis between linear probing

and more complicated but better performing strategies. That really depends on the ap

plication. There are reasons that you do want to use linear probing sometimes, it

is actually quite Common in practice For example, it's often interacts very well

with memory hierarchies So again, as with all of this hash and discussion. You know

the costs and benefits Are, are very subtle trade-offs between the different

approaches. If you have mission critical code that's using a hash table and you

really want to optimize it. Try a bunch of prototypes, and just test. Figure out

which one is the best, for your particular type of application. Let me conclude the

video with a quote from Canuck himself where he talks about the rapture of

proving this one of our one man is half a square theorem and how it was life

changing. He says I first formulated the following derivation, meaning, the proof

of that last theorem in 1962. Ever since that day, the analysis of algorithms has,

in fact, been one of the major themes in my life.