0:00

Let's begin by building up some intuition about what we would want from a hash

function, now that we know how hash functions are usually implemented. So

let's start with a hash function which is implemented by chaining. So what's going

to be the running time of our lookup, insert, and delete operations in a hash

table with chaining? Well, the happy operation in a hash table with chaining is

insertion. Insertion, we can just say without any qualifications, is constant

time. This requires the sort of obvious optimization that when you do insert, you

insert something at the front of the list in its bucket. Like, there's no reason to

insert at the end. That would be silly. So the plot thickens when we think about the

other two operations, deletion and the lookup. So let's just think about lookup.

Deletion's basically the same. So how do we implement lookup? Well, remember when

we get some key x, we invoke the hash function. We call h(x). That tells us

what bucket to look in. So if it tells us 17, we know that, you know, x may

or may not be in the hash table. But at this point we know that if it's in the

hash table, it's got to be in the linked list that's in the seventeenth bucket. So

now we descend into this bucket. We find ourselves a linked list. And now, we have

to resort to just an exhaustive search through this list in the seventeenth

bucket, to see whether or not X is there. So we know how long it takes to search a

regular list for some element. It's just linear and the list length. And now we're

starting to see why the hash function might matter. Right, so suppose we insert

100 objects into a hash table with 100 buckets. If we have a super lucky hash

function, then perhaps each bucket will get its own object. There'll be one object

in each of the lists, in each of the buckets, so. Theta of the list length

is just theta of one. We're doing great. Okay? So, a constant, constant link to lists,

means constant time insert delete. A really stupid hash function would map

every single object to bucket number zero. Okay, so then if you insert 100

objects, they're all in bucket number zero. The other 99 buckets are empty. And

so every time you do insert or delete, it's just resorting to the naive linked list

solution. And the running time is going to be linear and the number of

objects in the hash table. So the largest list length could vary anywhere from m/n,

where m is the number of objects, this is when you have equal linked lists,

to if you use this ridiculous constant hash function, m, all the objects in the

same list. And so the main point I'm trying to make here, is that you know,

first of all, at least with chaining, where the running time is governed by the

list length, the running time depends crucially on the hash function. How well

the hash function distributes the data across the different buckets. And

something analogous is true for hash tables that use open addressing. Alright

so here there aren't any lists. So you don't, there's no linked lists to keep

track of. So the running time is going to be governed by the length of the probe

sequence. So the question is how many times do you have to look at different

buckets in the hash table before you either find the thing you're looking for,

or if you're doing insertion, before you find an empty bucket in which to insert

this new object. So the performance is governed by the length of the probe

sequence. And again, the probe sequence is going to depend on the hash function. For

really good hash functions in some sense, stuff that spreads data out evenly, you

expect probe sequences to be not too bad. At least intuitive, And for say the

constant function you are going to expect these probe sequences to grow linearly

with the numbers of object you insert into the table. So again this point remains

true, the performance of a hash table in either implementation really depends on

what hash function you use. So, having built up this intuition, we can now say

what it is what we want from a hash function. So first we want it to lead to

good performance. I'm using the chaining implementations as a guiding example. We

see that if we have a size of a hash table, n, that's comparable to the number

of objects, m, it would be really cool if all of the lists had a length that was

basically constant; therefore we had our constant time operations. So equal length

lists is way better than unequal length lists in a hash table with chaining. So

we, we want the hash function to do is to spread the data out as equally as possible

amongst the different buckets. And something similar is true with open

addressing; in some sense you want hash functions to spread the data uniformly

across the possible places you might probe, as much as possible. And in

hashing, usually the gold standard for spreading data out is the performance of a

completely random function. So you can imagine for each object that shows up you

flip some coins. With each of the n buckets equally likely, you put this

object in one of the n buckets. And you flip independent coins for every different

object. So this, you would expect, you know, because you're just throwing darts

at the buckets independently, you'd expect this to spread the data out quite well.

But, of course, it's not good enough just to spread data out. It's also important

that we don't have to work too hard to remember what our hash function is and to

evaluate it. Remember, every time we do any of these operations, an insert or a

delete or a lookup, we're going to be applying our hash function to some key x.

So every operation includes a hash function evaluation. So if we want our

operations to run a constant time, evaluating the hash function also better

run in constant time. And this second property is why we can't actually

implement completely random hashing. So there's no way we can actually adjust when

say we wanted to insert Alice's phone number, flip a new batch of random coins.

Suppose we did. Suppose we flipped some random coins and it tells us to put

Alice's phone number into the 39th bucket, while. Later on, we might do a

lookup for Alice's phone number, and we better remember the fact that we're

supposed to look in the 39th bucket for Alice's phone number. But what does that

mean? That means we have to explicitly remember what choice we made. We have to

write down. You get a list of effects that Alice is in bucket number 39. In every

single insertion, if they're all from in the point of coin flips, you have to

remember all of the different random choices independently. And this really

just devolves back to the naive list base solution that we discussed before. So,

evaluating the hash function is gonna take us linear time and that defeats the

purpose of a hash table. So we again we want the best of both worlds. We want a

hash function which we can store in ideally constant space, evaluate in

constant time, but it should spread the data out just as well as if we had this

gold standard of completely random hashing. So I want to touch briefly on the

topic of how you might design hash functions. And in particular good hash

functions that have the two properties we identified on the previous slide. But I

have to warn you, if you ask ten different, you know, serious hardcore

programmers, you know, about their approach to designing hash functions,

you're likely to get ten somewhat different answers. So the design of hash

functions is a tricky topic, and, it's as much art as science at this point. Despite

the fact that there's a ton of science, there's actually a very beautiful theory,

about what makes good hash functions. We'll touch on a little bit of that in a, in a

different optional video. And if you only remember one thing of, you know, from this

video or from these next couple slides, the thing to remember is the following.

Remember that it's really easy to inadvertently design bad hash functions

and bad hash functions lead to poor hash table performance. Much poorer than you

would expect given the other discussion we've had in this video. So if you have to

design your own hash function, do your homework, get some examples, learn what

other experts are doing and use your best judgment. If you do just something without

thinking about it, it's quite possible to lead to quite poor performance, much

poorer than you were expecting. So to drive home this point, suppose that you're

thinking about keys being phone numbers. So let's say, you know, I'm gonna just be

very kinda United States centric here. I'm just gonna focus on the, the ten digit

phone numbers inside the US. So the universe size here is ten to the ten,

which is quite big. That's probably not something you really want to implement

explicitly and let's consider an application where, you know, you're only

say, keeping track of at most, you know, 100 or 1,000 phone numbers or something

like that. So we need to choose a number of buckets, let's say we choose 1,000

buckets. Let's say we're expecting no more than 500 phone numbers or so. So we double

that, we get a number of buckets to be equal 1,000. And now I got to come up with

a hash function. And remember, you know, a hash functions by definition. All it does

is map anything in the universe to a bucket number. So that means it has to

take as input a ten digit phone number and spit as output some number between zero

and 999. And, beyond that we have flexibility of how to define this mapping.

Now, when you are dealing with things that have all these digits it's very tempting

to just project on to a subset of the digits. And, if you want a really terrible

hash function, just use the most significant digits of a phone number to

define a mapping from phone numbers to buckets. Alright, so I hope it's clear why

this is a terrible choice of a hash function. Alright, so maybe you're a

company based in the San Francisco Bay area. The area code for San Francisco is

415. So if you're storing phone numbers from customers in your area. You know

maybe twenty, 30, 40 percent of them are gonna have area codes 415. All of those

are going to hash to exactly the same bucket, bucket number 415 in this hash

table. So you're gonna get an overwhelming percentage of the data mapping just to

this one bucket. Meanwhile you know not all 1000 possibilities of, of these three

digits are even legitimate area codes. Not all three digit numbers are area codes in

the United States. So there'll be buckets of your hash table which are totally

guaranteed to be empty. So you waste a ton of space in your hash table, you have a

huge list in the bucket corresponding to 415, you have a huge list in the

bucket corresponding to 650, the area code at Stanford. You're gonna have a very slow

look up time for everything that hashes to those two buckets and there's gonna be a

lot of stuff that hashes to those two buckets, So terrible idea. So a better but

still mediocre hash function would be to do the same trick but using the last three

digits instead of the first three digits. This is better than our terrible hash

function because there aren't ridiculous clumps of phone numbers that have exactly

the same last three digits. But still, this is sort of assuming you're using this

hash function as tantamount to thinking that the last three digits of phone

numbers are uniformly distributed among all of the 1,000 possibilities. And really

there's no evidence if that's true. Okay? And so there's going to be patterns and

phone numbers that are maybe a little subtle to see with the naked eye, but

which will be exposed if you try to use a mediocre hash function like this. So let's

look at another example. Perhaps you are keeping track of objects just based by

where they are laid out in memory. So in other words the key for an object is just

gonna be its memory location and if these things are in bytes, then you are

guaranteed that every memory location will be a multiple of four. So for a second

example let's think about a universe where the possible keys are the possible memory

locations, So here you're just associating objects with where they're laid in memory,

and a hash function is responsible for taking in as input some memory location of

some object and spitting out some bucket number. Now generally, because of, you

know the structure of bytes and so on, our memory locations are going to be multiples

of some power of two. In particular, memory locations are going to be even, And

so a bad choice of a hash function. Would be to take, remember, the hash function

takes the input of the memory location, which is, you know, some possibly really

big number, and we wanna compress it, we want to output a bucket number. Now let's

think of a hash table where we choose N equals 10^3, or 1000 buckets.

So then the question is, you know, how is this hash function going to take this big

number, which is the memory location, and squeeze it down to a small number. Which

is one of the buckets and so let's just use the same idea as in the mediocre hash

function, which is we're gonna look at the least significant bits so we can express

that using the mod operator. So let's just think about we pick the hash function h(x)

where h is the memory location to be x mod 1000 There again, you know, the

meaning of 1,000 is that's the number of buckets we've chosen to put in our hash

table because, you know, we're gonna remember roughly at most 500 different

objects. So don't forget what the mod operation means, this means you just,

essentially subtract multiples of 1,000 until you get down to a number less than

1,000. So in this case, it means if you write out x base ten, then you just take

the last three digits. So in that sense, this is the same hash function as our

mediocre hash function when we were talking about phone numbers. So we

discussed how the keys here are all going to be memory locations; in particular

they'll be even numbers. And here we're taking their modulus with respect to an

even number. And what does that mean? That means every single output of this hash

function will itself be an even number. Right, you take an even number, you

subtract a bunch of multiples of a 1000, you're still going to have an even number.

So this hash function is incapable of outputting an odd number. So what does

that mean? That means at least half of the locations in the hash table will be

completely empty, guaranteed, no matter what the keys you're hashing is. And

that's ridiculous. It's ridiculous to have this hash table 50 percent of which is

guaranteed to be empty. And again, what I want you to remember, hopefully long after

this class is completed is not so much these specific examples, but more the

general point that I'm making. Which is, it's really easy to design bad hash

functions. And bad hash functions lead to hash table performance much poorer than

what you're probably counting on. Now that we're equipped with examples of bad hash

functions. It's natural to ask about, you know, what are some good hash functions?

Well it's actually quite tricky to answer that question. What are the good hash

functions, and I'm not really going to answer that on this slide. I don't promise

about hash functions that I'm going to tell you about right now, are good in a

very strong sense of the word. I will say these are not obliviously bad hash

functions, they're let's say, somewhat better hash functions. And in particular

if you just need a hash function, and you need a quick and dirty one, you don't want

to spend too much time on it. The method that I'm going to talk about on this slide

is a common way of doing it. On the other hand, if you're designing a hash function

for some really mission-critical code, you should learn more than what I'm gonna tell

you about on this slide. So you, you should do more research about what are the

best hash functions, what's the state of the art, if you have a super important

hash function. But if you just need a quick one what's, what we say on this

slide will do in many, in most situations. So the design of a hash function can be

thought of as two separate parts. So remember by definition a hash function

takes as input something from the universe. An IP address, a name, whatever

and spits out a bucket number. But, it can be useful to factor that into two separate

parts. So first you take an object which is not intrinsically numeric. So,

something like s string or something more abstract. And you somehow turn an object

into a number, possibly a very big number. And then you take a possibly big number and you map it to a much smaller number,

namely the index of a bucket. So in some cases I've seen these two steps given the

names like the first step is formulating the hash code For an object, and then the

second step is applying a compression function. In some cases, you can skip the

first step. So, for example, if your keys are social security numbers, they're

already integers. If they're phone number, they're already integers. Of course, there

are applications where the objects are not numeric. You know, for example, maybe

they're strings, maybe you're remembering names. And so then, the production of this

hash code basically boils down to writing a subroutine that takes, as input, a

string, and outputs some possibly very big number. There are standard methods for

doing that, it's easy to find resources to, to give you example code for

converting strings to integers you know, I'll just say one or two sentences about

it. So you know each character in a string it is easy to regard as a number in

various ways. Either you know just say it is ASCII, well ASCII code then you just

have to aggregate all of the different numbers, one number per character into

some overall number and so one thing you can do is you can iterate over the

characters one at a time. You can keep a running sum. And with each character, you

can multiply the running sum by some constant, and then add the new letter to

it, and then, if you need to, take a module list to prevent overflow. And the

point of me giving you this one to two sentence of the subroutine is just to give

you a flavor of what they're like, and to make sure th at you're just not scared of

it at all. Okay? So it's very simple programs you can write for doing things

like converting from strings to integers. But again, you know, I do encourage you to

look it up on the Web or in a programming textbook, to actually look at those

examples. Okay? But there are standard methods for doing it. So that leaves the

quest, question of how to design this compression function. So you take as input

this huge integer. Maybe your keys are already numeric, like Social Security

numbers or IP addresses. Or maybe you've already some subroutine to convert a

string, like your friend's name, into. Some big number, but the point is you have

a number in the millions or the billions, and you need to somehow take that and

output one of these buckets. And again think of there being maybe a thousand or

so buckets. So the easiest way to do that is something we already saw in a previous

slide, which is just to take the modulus, With respect to the number of buckets. Now

certainly one positive thing you can say about this compression function is its

super simple, Both in the sense that it's simple to code and in the sense that it's

simple to evaluate. Now remember that was our second goal for a hash function. It

should be simple to store, it is actually nothing to store. And it should be quick

to evaluate. And this certainly fits the bill. Now the problem is, remember the

first. Property of a hash function that we wanted is that it should spread the data

out equally. And what we saw in the previous slide is that at least if you

choose the number of buckets N poorly, then you can fail to have the first

property. And in that respect you can fail to be a good hash function. So if for

example if N is even and all of your objects are even, then it's a disaster,

all of the odd buckets go completely empty. And honestly, you know, this is a

pretty simplistic method. Like I said, this is a quick and dirty hash function.

So, no matter how you choose the number of buckets N, it's not gonna be a perfect

hash function in any sense of the word. That said, there are some rules of thumb

for how to pick the number of buckets, how to pick this modulus, so that you don't

get the problems that we saw on the previous slide. So, I'll conclude this

video just with some standard rules of thumb. You know, if you just need a quick

and dirty hash function, you're gonna use the, the modulus compression function, how

do you choose N? Well, the first thing is we definitely don't want to have the

problem we had in the last slide, where we're guaranteed to have these empty

buckets no matter what the data is. So what went wrong in the previous slide?

Well. The problem is that all of the data elements were divisible by two. And the

hash function modulus, the number of buckets, was also divisible by two. So

because they shared a common factor, namely two, that guaranteed that all of

the odd buckets remained empty. And this is a problem, more generally, if the data

shares any common factors with N, the number of buckets. So, in other words, if

all of your data elements are multiples of three, and the number of buckets is also a

multiple of three, you got a big problem. Then everything's gonna hash into bucket

numbers which are multiples of three, too, that's if your hash table will go

unfilled. So the upshot is, you really want the number of buckets to not share

any factors With the data that you're hashing. So, how do you reduce the number

of common factors? Well, you just make sure the number of buckets has very few

factors, which means you should choose N to be a prime number, 'kay? A number that

has no nontrivial factors, And let me remind you, the number of buckets should

also be comparable to the size of the set that you're planning on storing. Again, at

no point did we need "N" to be, you know, very closely connected to the number of

elements that you're storing just within, say some small constant factor. And you

can always find a prime within a small constant factor of a target number of

elements to store. If the number of buckets in your hash table isn't too big,

if it's just in say the thousands or maybe the tens of thousands, then, you know, you

can just look up a list of all known primes up to some point, and you can just

sort of pick out a prime which is about the magnitude that you're looking for. If

you're gonna use a really huge number of buckets in the millions or more, then

there are algorithms you can use for primarily testing which will help you find

a prime in about the range that you're looking for. >> So that's the first order

rule of thumb you should remember if you're using the modulus compression

function, which is set the number of buckets equal to a prime. So you're

guaranteed to not have non-trivial common factors of the modulus shared by all of

your data. So there's also a couple of second order optimizations, which people

often mention. And you also don't want the prime; you want the prime to be not too

close to patterns in your data. So what does that mean, Patterns in your data?

Well, in the phone number example we saw that patterns emerged in the data when we

expressed it base ten. So for example, there is crazy amounts of Lumping in the

first three digits when we expressed a phone number-based ten, Because that

corresponded with the area code. And then, with. Memory locations when we express,

express it base two, there are crazy correlations in the low orbits. And these

are the two most common examples. Either there's some digit, to the base ten

representation or digits in the base two representation where you have, you know,

patterns that is non-uniformity. So that. Suggests that the prime, that, N that you

choose, you know, all else being equal, shouldn't be too close to a power of two,

and shouldn't be too close to a power of ten. The thinking being that, that will

spread more evenly data sets that do have these patterns in terms of base two

representation, or base ten representations. So in closing, this is a

recipe I recommend for coding of hash functions if what you're looking to do is

sort of minimize program ming, programmer time, subject to not coming up with a hash

function, which is completely broken. But I want to reiterate, this is not the state

of the art in hash function design. There are hash functions which are in some sense

better than the ones that expand on this slide. If you're responsible for some

really mission critical code that involves a hash function, you should really study

more deeply than we've been able to do here. We'll touch on some issues in, of

the different optional video, but really you should do additional homework. You

should find out about the state-of-the-art about hash function design. You should

also look into implementations of open addressing in those probing strategies.

And above all you really should consider cold, coding up multiple prototypes and

seeing which one works the best. There's no silver bullet, there's no panacea in

the design of hash tables. I've given you some high-level guidance about the

different approaches. But ultimately it's gonna be up to you to find the optimal

implementation for your own application.