0:00

So in this video we'll take a peek under the hood of hash functions. And I'll

discuss some of the high level principles by which their implemented. So let's

briefly review the raison d'etre of a hash table. So the purpose in life for a hash

table is to support super-fast lookups. So maybe you're keeping track of the

transactions that happened on your website yesterday. Maybe you're keeping track of

your employees; maybe you're keeping track of IP addresses in an Internet router.

Maybe you're keeping track of chess configurations in a, in a chess-playing

program, Whatever, The point is, you want to be able to insert stuff into a hash

table, and later remember whether something's there or whether something's

not there. So the implementations we'll discuss will generally also support

deletions. But that's pretty much it. It's a very restricted set of operations. But

the hash, It was going to execute them at very, very well. So, basically in constant

time, again subject to some fine print, which we'll discuss a little bit in this

video but, then more deeply in a separate optional video. So, the two caveats are

first of all the hash table had better be properly implemented. It's actually pretty

easy to screw up a hash table to screw up hash functions. We'll talk a bit about

that in a few minutes and, then, also, the data should, in some sense, be

non-pathological, and that, we will discuss more deeply in a separate video.

Alright, so let me give you an initial glimpse of some of the magic that's

happening under the hood in hash functions. So, at first let me say exactly

what the setup is. The first step is to identify all the things that you might

want to be storing. So, in other words, the universe of your application, So this

would be something like, all possible I-P addresses, of which there's 2^32 .

All possible names you might encounter, perhaps with a maximum of, say,

30 characters. All possible configurations of a chessboard, and so on, And one thing

I hope you can appreciate from these examples is that, in many cases, this

universe is really big. Sothe number of ] IP address is, quote unquote, only two 2^32.

The number of all names, you're probably talking more like 26 raised to the 30.

All chessboard configurations I don't even wanna think about. And what you

wanna accomplish is, you wanna maintain an evolving subset of this universe U. So

maybe you wanna keep track of all the IP addresses you've seen on your website in

the last 24 hours. You wanna keep track of the phone numbers of all of your friends.

You wanna keep track of the chessboard configurations that you've explored in the

past three seconds, whatever. And again I hope what is clear from the applications

we've been discussing, is that the set S is usually of a reasonable size. It's,

it's something you could store in main memory. You know, it maybe it's tens of

thousands of IP addresses. Maybe it's, you know, a few hundred names of your various

friends. You know, maybe it's in the, you know, millions of chessboard

configurations, but still way, way, way smaller than the size of the universe. So

without data structures, you'd have to resort to other unsatisfactory solutions

to maintaining this set. So the first thing you could try, as we discussed in

the previous video, would be just have an array with one position for every

imaginable thing you might want to store in your set. So this is the solution

that's going to work well if all of your friends happen to have names that are

integers between 1 and 10,000, but doesn't scale when the universe size

becomes really big, as in most of these applications. So, the good news is, is of

course, is an array of and it's of course fast random access so you can access any

position in constant time. So, if you have an array base solution index by all the

elements of the universe, you can do constant time, insert, delete and look up.

The bad news is, is the space requirement is proportional to the

universe. And again, forget about being unsatisfactory. That's just literary

impossible. Infeasible in many applications in which you'd use hash

tables. Now of course to get the memory proportional to the size of the set stuff

that you're storing, an easy solution would just be to use a list. You know, say

a doubly-linked list. Something like that. Now with a list-based solution the good

news is, is your memory is certainly proportional to the size of the set that

you're storing, and independent of the size of the universe from which these

elements are drawn. The bad news is that to figure out whether something is, or is

not, in a list you generally have to traverse through most of that list. And

that's gonna take up time proportional to the length of the list. So, really the

question we're faced in implementing cache table is, can we get the best Of both

worlds, of these two naive solutions. And the one hand, we want to have the constant

time operations enjoyed by the array based solution. But on the other hand, we wanna

have the, linear space in the size of the set that we're storing; that we get in the

list based solution. So to get the best of both worlds, we are going to use an array

based solution. But the array will not be big. It'll not be with size proportional

to the universe. The array will only have size, you know, roughly the same as the

set that we're storing, So somewhere in the ball park of the cardinality of S. So

the first thing we do is we decide on how big we want our array to be. So that, that

length is gonna be called n. We're gonna have an array of length n. And n is gonna

be in the ballpark of the size of S. It's gonna depend on a few things. Exactly how

n compares to S, but for now think of n as like double the size of S. We're gonna be

calling each entry of the array a bucket, so there's n buckets, and then, the size

of S is about 50 percent of the number of buckets, let's say. So one objection you

might legitimately raise at this point is, you know I thought, I said the set was

dynamic. The set S. Right? Stuff can be added, stuff can be deleted. So the size

isn't always the same. It can fluctuate over time. So what does it mean to define

an array which is the, roughly the same length as this changing set. So for

simplicity, for the purposes of this video to focus on the key points I am going to

assume that the set size S. While S itself can be changing, I'm going to assume that

the size of S doesn't fluctuate too much. So there are additional bells and whistles

you can add to a hash table implementation, and they're all quite

natural. I think most of you could probably figure them out on your own, to

deal with the fact that S might be changing sizes. So for example, you can

just keep track of how many elements are in your hash table. And when it exceeds a

big, a certain threshold, so when it's too big relative to the size of your array,

you just double the array. And then you reinsert all of the elements into this new

doubled array. Similarly, if you want to, if the set shrinks, you can have tricks

for shrinking the array dynamically as well. So I'm not gonna discuss these bells

and whistles for resizing your hash table dynamically. They are, of course Important

for a real implementation, and they are part of the implementations in the

standard programming libraries. But I view those as sort of a, a second order point

in the implementation of a hash table. And I wanna focus on the first order points,

in this video. So, summarizing, think of the set S. There are insertions and

deletions we have to accommodate. But, you know, S is gonna be roughly the same size.

And the number of buckets will be, you know, within a constant factor of the size

of the set. All right so there we have our array with totally reasonable space, space

proportional to the size of the set that we are storing. And now what we want is we

want is some way of translating between the things that we care about, say our

friends names or whatever the elements in the universe are to the positions in this

array. So the object responsible for that translation from keys drawn from this universe to

positions in this array is called a hash function. So formally, a hash function

Takes as input a key. So this is gonna be an IP address or the name of somebody or a

chessboard configuration or whatever. And it's going to spit out an position in this

array. So I'm gonna label the array entries from 0 to n-1 for this lecture.

Obviously at the moment this is super underspecified. There's a zillion

functions you could choose. Which one you use, we'll talk about that, but for now

there's just gonna be some hash function mapping from elements of the universe to

buckets, to positions in this array. Now, as far as the semantics of this hash

function, what the hash function is doing, it's telling us in which position we

should store a given key from the universe. So, if we have some new friend

named Alice. And we run Alice, we key Alice through the hash function and it

gives us a 17. It says we should store Alice's phone number in position

17 of the array. If we have some crazy chessboard configuration, we feed it

into a hash function and it spits out 172, it says we should remember this chessboard

configuration in the 172nd bucket of this array. So again, given x, which is some

key from this universe, we invoke a hash function to get a position in this array,

to get a bucket. And then that is where we try to store this x and any associated

data with it. So that's the high leveled idea of how you implement a hash

table, but we're quite far from done, And in particular there is a serious issue,

that we're going to have to deal with, that's fundamental to implementing hash

tables, and that's the notion of a collision. So probably many of you may

have already noticed that this problem might occur. Which is well what happens if

we're storing our friend's phone numbers, and you know Alice shows up and we ask our

hash function where to store Alice's phone number, and it says oh bucket number

17, And then our friend Bob shows up, and we ask our hash function where to

store Bob's phone number, and what if the hash function also says bucket number

17 for Bob? What do we put in bucket at 17? Do we put Alice

there, do we put Bob there, do we put them both there? How do we deal with these

so-called collisions? So, the next quiz is meant to give, to get you thinking about

collisions, and in some sense, how truly unavoidable they really are. [sound],

[sound] All right. So the correct answer to this question is the first answer,

believe it or not. All you need is 23 people in a room before you're equally

likely to have two people with the same birthday as not. So if you're looking to,

to skim a little money off of your non-mathematical friends, this is one way

you can do it. Go to cocktail parties with about 40 people and place bets with people

that there are two people in the room with the same birthday. So if you have 367

people, well there's only 366 distinct birthdays, I'm counting February 29th

here as one of them. So by the pigeonhole principle, certainly the

probability is 100%. By the time you get to 367. Now, by the time you're at 57.

You're already at 99%. So you already have overwhelming probability to have a

duplicate birthday with 57 people. So of course, with 184 you're gonna be almost at

100%, 99.99. Who knows? Some large number of 9's, And at 23, you're at 50%. So many

people find this quite counter-intuitive that you only need 23 people to get a

duplicate birthday on average. And so this is a, this is a quite famous example and

it sometimes goes by the birthday paradox. Calling it a paradox is sort of a

misnomer. A paradox, you know, often suggests some kind of logical

inconsistency. There's no logical inconsistency here. It's just that

people's brains are not really wired to have this intuition, for whatever reason.

So, but it's really just math. You can work out the math, and, and, and you can

just solve it. So, more generally, the principle behind the birthday paradox is

the following. So suppose you have a calendar, perhaps on some different

planet, which has K days. Where each, everybody's equally likely to have each of

the K days as their birthday. Then it's about the square root of k people that you

need in a room before you're equally likely to have a duplicate, or not have a

duplicate. Okay, and the reason that you get the square root effect is because if

you think about it. There's a quadratic number of pairs of people in the room, so

that's a quadratic, and the number of people Opportunities to have a duplicate.

Right? So, each pair of people could be a duplicate, there's a quadratic number of

pairs. And so, that's why, once the number of pairs starts reaching about the number

of different days, you're, you're about, you're likely to see a duplicate around

that point. So you might be wondering why I'm telling you about the birthday paradox

in the middle of a lecture about hashing, but really it's quite relevant. So imagine

for example you defined a hash function in the following way. Now to be clear, this

is not a practical hash function, but just for the purposes of discussion, imagine

you have a hash function which randomly assigned every single key to a uniform

bucket. 'Kay, so for each, each of the 1/n buckets equally likely. Then what

the birthday paradox says is, even for a very small dataset, you are already gonna

have a pair of things colliding. All right, So if you have an n buckets, so

maybe your n is like, 10,000, all you need is roughly 100 elements in your data set,

and despite the fact that the table is only going to be one percent full, you're

already going to see a collision, okay? So 99 percent of them are empty, but you're

going to have one bucket that has two, so that's sort of annoying. So the birthday

paradox says, you start getting collisions with the hash function, even with the

really tiny data sets. So in this sense, if you're going to have hash tables,

you've got to deal with collisions. There's going to be a fair number of them,

and you need some method for resolving them So, collisions are a fact of life

when you're talking about hashing. Where again, by collision, what I mean is two

different keys. So two different elements x and y from the universe that hash to the

same bucket, Who have the same hash value, So in general we can think of a hash

function as doing a compression of sorts. So we have a huge universe U and we have

this very modest size array A with the only n buckets. Where n, we're thinking

of as being much, much, much smaller than U. So, of course, this hash function has

to map various elements of U to the same bucket. So what are we gonna do about it?

How are we going to resolve these collisions? Well, there's two different

solutions which are both quite prevalent in practice. So solution number one is

called chaining, or sometimes you'll also see it called separate chaining. And this

is a very natural solution; it's also the one that's relatively easy to analyze

mathematically. What you do is just for elements that hash to the same bucket, you

just revert to the list-based solution that we talked about in a previous slide.

So, each of the n buckets will not necessarily contain just merely 0 or 1 element

, it will contain a list within a principle unbounded number of elements.

Okay, so when we use chaining, it's done quite straight-forward to figure out how

to implement all of the hash table operations, namely, insert, delete and

look-up, you just hash something to the appropriate bucket and then you just do

insert, delete or look-up, as appropriate, in the list that's in that bucket. So just

to make clear that everything is type checking, so here h(x), this is the bucket

for x. That's what's specified by the hash function. And then, in the h(x) position

of this array A, in the h (x), the bucket is where we find the linked list that is

going to contain x. So just to give a cartoon example, if you had, say, four

buckets, Maybe, you know, the first bucket has exactly one record. Corresponding to

Alice, maybe the second bucket just has a null pointer. No one's been inserted in the

second bucket. And then the third bucket we have, let's say, both Bob as well as

Daniel. And then maybe in the fourth bucket we have Carol. Okay, so because we

have a collision between Bob and Daniel, both map to the third bucket, and we

resolve that just by having a linked list, with Bob and Daniel in some order. So the

second solution which is trickier to talk about mathematically but still quite

important practically is called open addressing. And the principal in open

addressing is you're not going to use any space for pointers. You're not gonna

have lists. So you're only gonna have one object per bucket of the array. So another

question is what happens if, you know, you try and insert Daniel and you go, you

invoke the hash function on Daniel and it takes you to a bucket that already

contains Bob? That means there's no room for Daniel. So what you're going to do is

you're gonna probe the hash table in some other position. So a hash function is, is

now gonna be replaced by a hash sequence, where you try, the hash function tells you

the first bucket to try to insert Daniel; failing that, a second bucket in which to

try to insert Daniel; failing that, a third bucket to try to insert Daniel; and

so on. And you just keep trying till you find an open position somewhere in the

15:23

array. So there's various strategies for trying to figure out the probe

sequence. One strategy is if you fail and save bucket 17, which is where the

hash function tells you to go first. You just try bucket 18, then 19,

then, 20, then 21 and so on, until you find your first open slot. So that's

called linear probing. And another approach is double hashing. So this is a

solution where you actually have two hash functions, hash function 1 and hash

function 2. And the idea is, suppose you're trying to insert, say, Daniel, into

a hash table with open addressing, and you evaluate both of the hash functions. And

the first one comes up 17, and the fir-, the second one comes up 23. So, as

usual, the has-, first hash function will specify where you look first. So if it

evaluates on Daniel to 17, you look in the seventeenth position of the array,

And if, if it's empty, that's where you insert Daniel. Now, if it's full, what you

do is you use the second Hash value to be an additive shift, so. Unlike linear

probing where after seventeen, you look at eighteen. With double hashing, if the

second hash function gives you 23, that's gonna be your offset. So after seventeen,

you look at bucket 40. If 40 is already full, you look at bucket 63. If bucket 63

is already full then you look at bucket 86. So you keep adding increments of 23

until you finally find a bucket where, that's empty and that's where you insert

Daniel. Now of course, if you try to insert some other name, if you try to

insert Elizabeth, you're gonna get two totally different numbers in general. So

maybe you'll get 42 and 27, and so here the probed sequence will be 42, failing

that 69, failing that 96 failing that 123 and so on, So a question you should have

at this point, is, you know. I've told you two solutions to resolving collisions in a

hash table. And you're probably asking, well, which ones should you use if you

have to implement your own hash table? And, you know, as usual, if I present you

with two different solutions for the same problem. You can probably rest assured

that neither one dominates the other, right? Otherwise I wouldn't waste your

time by presenting both of them to you. So, sometimes chaining's gonna perform

better, and sometimes, open addressing's gonna perform better. And of course, it

also depends on what kind of metric that you care about. So there are a couple of

rules of thumb that I can tell you. So first of all if space is at a real premium

you might want to consider open addressing instead of chaining, and that's cause with

chaining you do have this excess, not huge but you do have this little bit of

space overhead and dealing with all these pointers in this link list. So if you want

to avoid that, you might want to think about open addressing. The second rule of

thumb is deletion is trickier with open addressing than with chaining, but

deletion is clearly not difficult at all, either to code or understand when you use

chaining cause it just reduces chaining to a linked list which of course you all know

how to do. Open Addressing is, it's not impossible to implement deletion but it's

much trickier. So if deletion's a, a crucial operation for you, that might

steer you towards thinking about chaining. But ultimately, if it's really kinda

mission critical code, probably the best thing to do is implement both kinds of

solutions and just see which one works better. It's a little hard to predict how

they're gonna interact with memory hierarchies and that kind of thing.

They're both useful in their own contexts. Alright so we've covered the two most

prevalent ways of handling collisions. And we argued that collisions are inevitable

no matter you design you hash function. You're stuck with collisions and you can

do chaining or linked lists per bucket, or you can do addressing, where you actually

have a probe sequence in order of which you look at buckets until you find an

empty one. And the elephant in the room at this point is, you know, what is this hash

function? I have told you nothing about hash functions. All I told you is there is

some mapping from the set of the universe, so IP addresses, or names, or whatever to

a bucket number. Well what kind of function should you use? Excellent

question, Tons of research on that question, And to this day as much art as

science. But let's start talking about it.