0:00

This sequence of videos are going to take it to the next level with Hash tables and

understand more deeply the conditions under which they perform well, amazingly

well in fact As you know, we've constant time performance for all of their

operations. The main point of this first video is to explain in sense in which

every hash function has its own Kyptonite. A pathological data set for

it which then motivates the need to tread carefully with mathematics in

the subsequent videos. So, a quick review So remember that the whole purpose of a

hash table is to enable extremely fast look ups, ideally constant time lookups.

Now, of course, to have anything to look up, you have to also allow insertions. So

all hash tables are going to export those two operations And then, sometimes, a hash

table also allows you to delete elements from it. That depends a little bit on the

underlying implementation. So certainly when you have it implemented using

chaining, that's when you have one linked list per bucket, it's very easy to

implement deletion. Sometimes, with open addressing, it's tricky enough, you're

just gonna wind up punting on deletion. So when we first started talking about hash

tables, I encouraged you to think about them logically. Much the way you do as an

array, except instead of being indexed just by the positions of an array, it's

indexed by the keys that you're storing. So just like an array via random access

supports constant time look up, so does a hash table. There was some fine print

however with hash tables. Remember there were these two caveats. So the first one

is that the hash table better be properly implemented And this means a couple of

things. So, one thing that it means is that the number of buckets better be

commensurate with the number of things that you're storing in the Hash table, we'll

talk more about that in a second. The second thing it means is you better be

using a descent Hash function. So we discussed in a previous video the perils

of bad Hash functions, and it will be even more stringent with our demands on Hash

functions in the videos to come. The second caveat which I'll try to

demystify in a few minutes is that you better have non-pathological data. So, in

some sense, for ever Hash table there's Kyptonite , a pathological

data set that will render its performance to be quite poor. So in the video on

implementation detail we also discussed how hash tables inevitably have to deal

with collisions. So you start seeing collisions way before your hash table's

start filling up so you need to have some sort of method for addressing two

different keys that map to exactly the same bucket. Which one do you put there? Do you

put both there or what? So there's two popular approaches let me remind you what

they are First called chaining. So this is a very natural idea, where you just keep

all of the elements that hash to a common bucket in that bucket. How do you keep

track of all of them? Well, you just use a linked list. So, in the seventeenth

bucket, you will find all of the elements which ever hashed to bucket number 17.

The second approach which also has plenty of applications in practice is

open addressing. Here the constraint is that you are only going to store one item

one key in each bucket. So if two things mapped the bucket number seventeen, you

gotta find a separate place for one of them And so the way that you handle that

is you demand from your hash function not merely one bucket but rather a whole probe

sequence. So the sequence of buckets so that if you try to hash something into

bucket number seventeen, 17's already occupied then you go on to the next.

Bucket in the probe sequence. You try to insert it there And if it, you fail again

you go to third bucket in the sequence. You try to insert it there, and so on. So

we mentioned briefly the sorta two ways you can specify probe sequences. One is

called linear probing. So this is where if you fail in bucket seventeen you move on

to eighteen and then nineteen and then twenty and then 21. And you stop once you

find an empty one And that's where you insert the new element And another one is

double hashing And this is where you use a combination of two hash functions Where

the first hash function specifies the initial bucket that you probe. The second

hash function specifies the offset for each subsequent probe. So for example if

you have a given elements say the name Alice and the two hash functions give you

the number 17 and 23 then the corresponding finding probe sequence is

going to be initially 17, failing that we'll try 40, still failing that

we'll try 63, failing that we'll try 86 and so on. So in a course on the design

and analysis of algorithms like this one you typically talk a little bit more about

chaining than you do open addressing. That's not to imply that chaining is

somehow the more important one both of these are important But chaining is a

little easier to talk about mathematically. So we will talk about it a

little bit more cuz I'll be able to give you complete proofs for chaining whereas

complete proofs for open addressing would be outside the scope of this course. So

I'll mention it in passing but the details will be more about chaining just for

mathematical ease. So, there's one very important parameter which plays a big role

in governing the performance of a hash table, and that's known as the load

factor, or simply the load, of a hash table. And it's a very simple definition.

It just talks about how populated, a typical bucket of the hash table is. So

it's often denoted by alpha And in the numerator is the number of things that

have been inserted, and not subsequently deleted in the hash table. Divided by

the number of buckets in the hash table. So, as you would expect, as you insert

more and more things into the hash table, the load grows, keeping the number of

items in the hash table fixed as you scale up the number of buckets, the load drops.

So just to make sure that the notion of the load is clear, and that also you're

clear on the different strategies for resolving collisions, the next quiz will

ask you about the range of relevant alphas for the chaining and open addressing

implementations of the hash table. Alright, so the correct answer to this

quiz question is the third answer. Load factors bigger than one do make sense they

may not be optimal but they at least make sense for hash tables that implement with

chaining but they don't make sense for hash tables with open addressing. And the

reason is simple remember in open addressing you are required to store only

one object per bucket so as soon as the number of objects exceeds the number of

buckets there is no where to put the remaining objects. So the hash the hash

table will simply crash if, if load factor is bigger than one. On the other hand a hash table

with chaining there is no obvious problems with load factor bigger than one so you

can imagine a load factor equal to two say, say you insert 2,000 objects into a

hash table with 1,000 buckets. You know that means, hopefully At least in the best

case. Each buckets just gonna have a length list with two objects in it. So

there's no big deal. With having load factors bigger than one, and hash tables

With chaining. Alright, so let's then make a, a quite easy but also very important

observation about a necessary condition for hash tables to have good performance

And this goes into the first caveat that you better properly implement the hash

table if you expect to have good performance. So the first point is that

you're only gonna have constant time look-ups if you keep the load to be

constant. So for a hash table with open addressing, this is really obvious,

because you need alpha not just O(1) but less than one, less than 100 percent full,

otherwise the hash table is just gonna crash, cuz you don't have enough room for

all of the items, but even for hash tables that you implement using chaining, where

they at least make sense for load factors which are bigger than one, you'd better

keep the load not too much bigger than one if you want to have constant-time

operations. Right, so if you have, say, a hash table with. N buckets and you hash in

NlogN objects, then the average number of objects in a given bucket is gonna be

logarithmic And remember, when you do a lookup, after you hash to the bucket, you

have to do an exhaustive search through the linked list in that bucket. So if you

have NlogN objects and you hashed it with N buckets, you're expecting more like

logarithmic lookup time - not constant lookup time And then, as we discussed with

open addressing, of course, you need not just alpha = O(1), but alpha less

than one. And in fact, alpha better be well below one. You don't want to let an

open addressing table get to a 90 percent load or something like that. So I'm going

to write need Alpha less than less than one. So that just means you don't want to

let the load grow too close to 100%, you will see performance degrade. So again, I

hope the point of this slide is clear. If you want good hash table performance, one

of the things you're responsible for is keeping the load factor under control.

Keep it at most a small constant with open address and keep it well below 100%. So

you might wonder what I mean by controlling the load. After all, you know

you writing this hash table when you have no idea what some client's gonna do with

it. They can insert or delete whatever they want. So how do you, how do you

control alpha? Well, what you can control under the hood of your hash table

implementation is the number of buckets. You can control the denominator Of this

alpha so if the numerator starts growing at some point the denominator is going to

grow as well. So what actual implementations of hash tables do is they

keep track of the population of the hash table. How many objects are being stored,

and as this numerator grows, as more and more stuff gets inserted, the

implementation ensures that the denominator grows at the same rate so that

the number of buckets also increases. So if alpha exceeds some target, you know,

that could be say 75%, .75, or maybe it's.5. Then what you can do is you can

double the number of buckets, say in your hash table. So you define a new hash

table, you have a new hash function with double the range, and now having doubled

the denominator. The load has dropped by a factor two. So that's how you can keep it

under control. Optionally, if space is at a real premium, you can also shrink the

hash table if there's a bunch of deletions, say, in a chaining

implementation. So that's the first take away point about what has to be happening

correctly under the hood in order to get the desired guarantees for hash table

performance you gotta control the load. So you have to have a hash table whose size

is roughly the same as the as the number of objects that you are storing. So the

second thing you've gotta get right and this is something we've touched on in the

implementation videos is you better use a good enough hash function And so what's a

good hash function? It's something that spreads the data out evenly amongst the

buckets And what would really be awesome would be a hash function which works well

independent of the data And that's really been the theme of this whole course so

far, algorithmic solutions which work independent of any domain assumptions. No

matter what the input is, the algorithm is guaranteed to, for example, run blazingly

fast And I can appreciate that this is exactly the sort of thing you would want

to learn from a course like this, right? Take a class in the design analysis of

algorithms and you learn the secret hash function which always works well.

Unfortunately I'm not going to tell you such a hash function And the reason is not

cuz, I didn't prepare this lecture. The reason is not because people just haven't

been clever enough to discover such a function. The problem is much more

fundamental. The problem is that such a function cannot exist. That is for every

hash function it has its own kryptonite. There is a pathological dataset under

which the performance of this hash function will be as bad as the most

miserable constant hash function you'd ever seen. And the reason is quite

simple; it's really an inevitable consequence of the compressing that hash

functions are effectively implementing from some massive universe to some

relatively modest number of buckets. Let me elaborate. Fix any hash function as

clever as you could possibly imagine. So this hash function maps some universe

through the buckets indexed from 0 to n -1. Remember in all of the

interesting situations of hash functions, the universe size is huge, so the

cardinality of U should be much, much bigger than n. That's what I'm going to

assume here. So for example, maybe you're remembering people's names, and then the

universe is strings, which have, say at most, 30 characters, and N, I assure you

in any application is going to be much, much, much smaller than say, 26 raised to

the thirtieth power. So now let's use a variant on the pigeon hole principle, and

acclaim that at least one of these n buckets has to have at least a 1/n

fraction of the number of keys in this universe. That is, there exists a bucket

I, somewhere between 0 and n -1 Such that, at least the cardinality of

the universe over N Keys Hash to I get mapped to I under this hash function H. So

the way to see this is just to remember the picture of a hash function mapping in

principal any key from the universe, all keys from the universe to one of these

buckets. So the hash function has to put each key somewhere in one of the n buckets

So one of the buckets has to have at least a 1/n fraction above all of the possible

keys. One more concrete way of thinking, thinking about it is that you might want

to think about a hash table implemented with chaining. You might want to imagine,

just in your mind, that you hash every single key into the hash table. So this

hash table is going to be insanely over-populated. You'll never be able to

store it on the computer because it will have the full cardinality of U objects

in it, but it has U objects, it only has n buckets. One of those buckets has to have

at least U /n fraction of the population. So the point here is that no

matter what the hash function is no matter how clever you build it there's gonna be

some buckets say bucket number 31 which gets its fair share of the universe maps

to it. So having identified this bucket, bucket number 31 where it gets at least

its fair share of the universe maps to it now to construct our pathological dataset

all we do is picked from amongst these elements that get mapped to bucket number

31. So for such as a data set and we can make this data set basically as large as

we want because the cardinality of U/n is unimaginably big, because U itself is

unimaginably big Then in this data set, everything collides. The hash function

maps every single one to bucket number 31 and that's gonna lead to Terrible hash

table performance. Hast table performance which is really no better than the naive

linked list solutions. So for example an hash table with collisions, in bucket 31,

you'll just find a link listed every single thing that's ever been inserted

into the hash table and for open addressing maybe it's a little harder to

see, what's going on but again if everything collides, you're gonna

basically wind up with linear time performance A far cry from constant time

performance. Now, for those of you to whom this seems like just sort of pointless,

abstract mathematics, I want to make two points. The first is that at the very

least these pathological data sets. Tells us, indicates, that we will have to

discuss hash functions in a way differently than how we've been discussing

algorithms so far. So when we talked about merge sort, we just said it runs in n log

n time. No matter what the input is. Whether we discuss the Dijkstra's algorithm

It runs in n log n time, no matter what the input is. That first search, that

first search many a time no matter what the input is. We're gonna have to say

something different about hash functions. We'll not be able to say that a hash table

has good performance, no matter what the input is. This slide shows that's false.

The second point I want to make is that while this pathological data

sets of course are not likely to arise just by random chance. Sometimes you're

concerned about somebody constructing a pathological data set for your hash

function, for example in a denial service attack. So there's a very clever

illustration of exactly this point in a research paper, from 2003 By Crosby and

Wallach. So the main point of Crosby and Wallach was that there's a number of real

world systems And maybe there, most interesting application was a network

intrusion detection system, for which you could bring them to their knees by

exploiting badly designed hash functions. So these were all applications that made

crucial use of hash tables and, the feasibility of these systems really,

completed depended on getting cost and time performance from the hash tables. So

if you could exhibit a pathological data set for these hash tables and make the

performance devolve to linear, devolve to that of a simple link list solution, the

systems would be broken, they would just crash of they'd fail to function. Now we

saw in the last slide that every hash table does have its own Kryptonite. Has a

pathological data set But the question is. How can you come up with such a

pathological data set if you're trying to do a denial of service attack on one of

these systems? And so the systems that Crosby and Wallach looked at generally

shared two properties. So first of all, they were open source. You could inspect

the code. You could see what hash function it was that they were using And second of

all, the hash, function was often very simplistic. So it was engineered for speed

more than anything else And as a result, it was easy to, just be inspecting the

code, reverse engineer a data set. That really did break the hash table. That led

it That devolved the performance to linear. So for example, in the network

intrusion detection application, there was some hash table that was just remembering

the IP addresses of packets that we re going through, because it was looking for

patterns of pack, data packets that seemed to indicate some sort of intrusion. And

Crosby and Wallach just showed how sending a bunch of data packets to this system

with cleverly chosen sender IP's really did just crash the system because the hash

table performance blew up to an unacceptable degree. So how should we

address this fact of life that every hash function has a pathological data set? And

that question is meaningful both from a practical perspective, so what hash

functions should we use if we are concerned about someone constructing

pathological data sets, for example, the implemented denial-of-service attack, and

secondly, mathematically, if we can't give the kinds of guarantees we've given so

far, data-independent guarantees, how can we mathematically say that hash functions

have good performance. So let me mention two solutions, the first solution is, is

meant more just on the practical point. You know what hash function should you

implement if you are concerned with someone creating pathological data sets?

So there are these things called cryptographic hash functions. There for

example, one cryptographic hash function or really family of hash functions, for

different numbers of buckets, is SHA-2 And these are really outside the scope of

this course. I mean, you'll learn more about this in a In a course on

cryptography And obviously using these keywords you can look it up on the web and

read more about it. The one point I want to make is, you know these cryptographic

hash functions like SHA2, they themselves, they do have pathological datasets. They

have their own version of kryptonite. The reason that they work well in practice is

because it's infeasible to figure out what this pathological dataset is So unlike the

very simplistic hash functions which Crosby and Wallach found in the source

code of their applications, where it was easy to reverse-engineer bad datasets, for

something like SHA2 nobody knows how to reverse-engineer a bad dataset for it And

when I say infeasible, I mean in the usual cryptographic sense, in a way similar to

how one would say it's infeasible to break the RSA cryptosystem if you implement it

properly, or it's infeasible to factor large numbers except in very special case,

and so on. So that's all I'm going to say about cryptographic hash functions. I also

want to mention a second solution, which would be reasonable both in practice, and

also for which we can say a number of things mathematically about it Which is to

use randomization. So more specifically we're not going to design, a single.

Clever hash function Because again, a single hash function we know Must have a

pathological data set But we're gonna design a very clever, family of hash

functions And then, at run time. We're going to pick, one of these hash functions

at random. Another kind of guarantee you are going to want, we are going to be able

to prove on a family affairs function would be very much in the spirit of quick sort.

So you would call that in the quick sort algorithm, pretty much. For any

fixed pivot sequence there is a pathological input for which quick sort

will devolve to quadratic running time. So our solution was randomize quick sort

which is rather than committing up front to any particular method of choosing

pivots at run time we are gonna pick pivots randomly. What did we prove about

quick sort? We proved that for any possible input for any possible array the

average running time with quick sort was O(nlogn) with the average was over the

run time random choices of quick sort. Here we are gonna do the same thing we'll

now be able to say for any dataset. On average, with respect to our run time

choice of a hash function. The hash function will perform well, in the sense

that it will spread out the data of the data set evenly. So we flip to the

quantifiers from the previous slide. There, we said if we pre-commit to a

single hash function. If we fix one hash function, Then there's a data set that

breaks the hash function. Here we're flipping it, we're saying for each fixed

data set a random choice of a hash function is going to do well on average on

that data set. Just like in Quick Sort. This doesn't mean notice that we can't

make our program open source. We can still publish code which says here is our family

of hash functions and in the code we would be making a random choice from this set of hash

functions But the point is by inspecting the code you'll have no idea what was the

real time random choices made by the algorithm. So you'll know nothing about

what the actual hash function is so you won't be able to reverse engineer

pathological dataset for the real time choice of the hash function. So the next

couple of videos are gonna elaborate on the second solution Of using a real time

random choice of a hash function as a way of saying. You do well on every data set

At least on average. So let me just give you a road map of where we're going to go

from here. So I'm going to break the discussion of the details of a randomized

solution into three parts, spread over two videos. So in the next video we're going

to begin with the definition of what do I mean by a family of hash functions, so

that if you pick one at random, you're likely to do pretty well. So that's a

definition that's called a universal family of hash functions. Now a

mathematical definition by itself is worth approximately nothing. For it to be

valuable, it has to satisfy two properties. So first of all there have to

be interesting and useful examples that meet the definition. So that is, there

better be Useful looking hash functions that meet this definition of a universal

family. So the second thing will be to show you that they do indeed exist And

then the other thing a mathematical definition needs is applications. So that

if you can meet, the definition. Then, good things happen. So that'll be part