0:00

So, in this video, we're going to discuss Bloom filters which is a data structure

developed appropriately enough by Burton Bloom back in 1970. Bloom filters are

variant on hash tables, you'll recognize a lot of the ideas from our hash table

discussion. The win that you get in Bloom filters is that they are more space

efficient than run of the mill hash tables and they're going to handle, they

do allow for errors, there is a non zero false positive probability when you do

look ups but that's still a win for some applications. So, it's a very cool idea,

very cool data structure. You do see it used quite a bit in practice so let's

start talking about it. So, we'll go through the usual topics that we do

whenever we discuss a new data structure. So first, I want to tell you what

operations they support and what kind of performance you're going to expect from

those operations, in other words, what is the API corresponding to the data

structure. Secondly, I'm going to talk a little bit about what it's good for. So,

what are some potential application and then we'll take a peek under the hood.

I'll tell you some of the implementation details with an emphasis on explaining why

you get the kinds of performance trade offs that you do with Bloom filters. So,

to first order, the raison d'être of Bloom filters is exactly the same as a

hash table. It supports super fast inserts, super fast look ups. You can put

stuff in there and you can remember what you put in earlier. Now, of course, what

you should be wondering is what we already know what data structure that supports

super fast in certain look ups, a hash table. Why am I bothering to tell you

about yet another data structure with exactly those same operations? So, let me

tell you about the pros and cons of Bloom filters relative to run off the mill hash

tables as we've already discussed. The big win is that Bloom filters are more space

efficient than hash tables. No matter whether they are implemented with chaining

or with open addressing, you can store much less space per objects. In fact, as

we'll see, less space than that of an object itself using a Bloom filter. As

far as the cons, well, first of all, this is really for applications where you just

want to remember what kind of values you see. You are not trying to store pointers

to the objects themselves and just trying to remember values. So, the first drawback

of the Bloom filter is that because we want to be so space efficient, we don't

even want to remember the object itself just whether or not we've seen it before.

We're not going to be able to store the objects or even pointers to the objects in

a Bloom filter. We're just going to remember what we've seen and what we

haven't. So, some of you might know the terminology hash set for this kind of

variant of a hash table as opposed to a full blown hash table or hash map. The

second con is at least in the vanilla implementation of Bloom filters that I'm

going to describe here, deletions are not allowed. You can only insert, you can't

delete. The situation with deletions is very much similar to hash tables

implemented with open addressing. It's not that you can't have a Bloom filter that

accommodates deletion, you can, there are very instances of it but that requires

significantly more work and we're not going to discuss it here. So, the first

order at least for vanilla Bloom filters, you want to think of them as suitable for

applications or deletions or not a first order of operation. Now, the third con and

this is a drawback that we have not see previously using any data structures is

Bloom filters can actually make mistakes. Now, what kind of mistake could this kind

of data structure possibly make when all you're really doing is looking something

up. Well, one of mistake would be a false negative and that means you have inserted

something previously then you look it up and the hash table or the Bloom filter

says, it's not there. So, Bloom filters will not have false negatives of this

form. You've insert something, you look it up later, it's definitely going to

confirm that you inserted it in the past. But Bloom filters will have false

positives, that means that despite the fact you have never inserted say, a given

IP address into the, into the Bloom filter, if you look it up later, it will

say that you have. So, there will sometimes be in some sense phantom objects

in Bloom filters, objects which it thinks have been inserted even though they

haven't been. So, given that, I am now showing you two data structures with

essentially the same functionality, hash tables and Bloom filters, at least, if we

ignore the deletion issue. You might want to wonder which one is more appropriate,

which one is more useful. And because there is these trade offs between the two,

the answer as you expect is, it depends on the application, right? So, if it's an

application where space is really at a premium, you might want to turn to Bloom

filters especially if a small chance of a false positive is not deal breaker. If you

have some kind of application where false positives are absolutely out of the

question, of course, you should not use a Bloom filter and you want to think about a

hash table. So, what are some situations where people actually do use Bloom filters

where you either really care about space and/or you don't really care about this

false positive probability. For one of the earliest applications of Bloom filters,

this is not long time ago, this is something like 40 years ago, was the spell

checkers. So, how would you implement a spell checker using a Bloom filter? Well,

first you have this insert phase where you basically just go through the entire

dictionary word-by-word and you insert every valid word into the Bloom filter.

Then, afterwards, when you're presented with a new document that somebody has

written, you're going to go through the document word-by-word for each word, you

say, is this in the Bloom filter? That is, is this one of the legitimate word from

the dictionary which is previously inserted? If the Bloom filters says yes,

this word is in the dictionary as in we've stored and seen that before, then you

treat is as a correctly spelled word and if it's not in the Bloom filters, then you

treat it as incorrectly spelled word. Now, the false positive probability means this

isn't a perfect spell checker. I mean sometimes, you're going to look up a

misspelled word and the Bloom filter won't catch it and it willl actually say yes,

with small probability, we'll say, this is a legitimate word. So, you know, it's not

ideal but, you know, the, the English language is pretty big and space was

definitely at a premium, 40 plus years ago. So, it was a win for that application

at that time, to use Bloom filters to implement a spell checker. Another

application which, you know, remains relevant today is to keep track of a list

of forbidden passwords. Now, why would you have forbidden passwords? Well, maybe, you

want to keep track of password which are too weak or too easy to guess or too

common. You may, yourself, have used the piece of software or website at some point

where it asked you for a password and if you typed in something which is too simple

or too easy, rejected it and asked you to type in another one. So, one way to

implement a list of forbidden passwords is just with the Bloom filter and the idea is

similar to the spell checker. You first, insert into the Bloom filter all of the

passwords that you don't want anybody to use for whatever reason. Then, when a

client comes and tries to type in a new password, you look it up in the Bloom

filter and if you get a positive look up, then you tell the user, no, that's no

good, you can't use that password, choose another one. And this is an application

where you really don't care about the errors, you really don't care about the

fact that there's a false positive rate. Let's assume that the error rate is

something like one percent or 0.1%. So, what would that means in context, that

would just mean once in a while, one in a hundred clients or one in a thousand

clients actually types in a perfectly strong password that gets rejected by the

Bloom filter and they have to type in a second one. Okay, but big deal and if

space is at the, the premium, this is definitely a win to use this super

lightweight data structure to keep track of these blocked passwords. These days

certainly one the killer applications of Bloom filters is in software deployed on

network routers. So, the machinery out in the Internet which is responsible for

transmitting packets from one place to another. So, what are the reasons why

Bloom filters have found fertile application in network routers? Well,

first of all, you do have a budget on space, typically on network routers.

There's a lot of things that you got to do and you don't want to waste that much of

it on some random data structure to do one's specific task. So, you do have a

budget on space and also, you need super, super fast data structures, right? Since

these packets are coming in at this torrential rate which you can't even

imagine and you want to process these packets in real time, sending them off to

the next top. Bloom filters are the work force behind a lot of different tasks that

is done in the network router. You can imagine wanting to keep track of blocked

IP addresses, you can imagine keeping track of the contents of some cache so you

don't do spurious look ups. You can imagine maintaining statistics to check for denial

of service attacks and so on and so forth. So, summarizing as a expert programmer,

what is it that you should remember about Bloom filters, what purpose does this

tool serve in your tool box? Well, as far as the operation supported which is the

same as a hash table, the point is to have super fast inserts, super fast look ups.

But Bloom filters are more lightweight version of a hash table. So, they are more

space efficient but they do have this drawback of having a small error

probability. So, those are the key features you should remember when deciding

whether or not you are working on an application that could make good use of

this data structure. So, having discussed one of th e operations and what these data

structures are good for, let's take it to the next level, let's peer under the hood

and see how they are actually implemented. Cuz this is really a quite simple, quite

cool idea. So, like hash tables, Bloom filters have essentially two ingredients.

First of all, there's an array and second of all, there's a hash function or in

fact, several hash functions. So, we're going to have a random access array

except, instead of having n buckets or n slots as we've been calling them, each

entry in this array is just going to be a single bit. Each entry in this array can

only take on two values, zero or one. And the way they think about the space

occupied by Bloom filters is in terms of the number of bits per object that has

been inserted into the Bloom filter. So, if you have inserted the data set capital

S, then the total number of bits is n, the number of objects that have been inserted

is cardinality of s. So, n / |s| is the number of bits in this data structure that

you are using per entry in the data set. Now, you can tune a Bloom filter so this

ratio is any number of different quantities but for now, I encourage you to

think of this ratio as being eight, that is for each object stored in the Bloom

Filter, you are using only eight bits of memory. That will help you appreciate just

how amazing this data structures are, right, cuz maybe our data set is something

like IP addresses which is 32 bits so what I'm saying here, if this is eight, I'm

saying we are not, definitely not actually storing the IP address. So, we have this

32-bit object we are inserting and we are only using eight bits of memory. This is

how we are going to remember whether its there or whether its not. And again,

certainly, eight bits per object is way less than keeping a pointer to some

associated memory somewhere. So, this is a really impressive minimal use of space to

keep track of what we've seen and what we haven't. And secondly, we need mappings of

given an object to say, given the IP address, what are the relevant bits for

seeing if we've seen this IP address before or not? So, in a Bloom filter, its

important to have not one hash function, but several hash functions. So, k is going

to denote the number of hash functions in the Bloom filter which you think of k is

some small constant somewhere, you know, three, four, five, or something like that.

So, obviously it's a little bit more complicated to use multiple hash functions

as supposed to just one hash function. But it's really not that big of deal. So, we'll

call from our discussion of say, universal hashing, we have identified the entire

families of hash functions which will work well on average. So, instead of choosing

just using one hash function at random from universal family, you gave me k

independent random choices from universal family. In fact, in practice, it seems to

typically be enough to just use two different hash functions and then generate

k different linear combinations of those two hash functions. But for the purposes

of this video, let's just assume that we've done enough work to come up with k,

different good hash functions and that's what we're going to be using in our Bloom

filter. So, the code for both insert and delete is very elegant. So, let's

start by insertion. So, suppose we have some new IP address and we want to stick

into these Bloom filter, what we do? Well, we'll just evaluate each of our k hash

functions on this new object. Each of those tells us an index into our array of

bits and we'll just set those k bits equal to one. And when we do this insert, we

don't even bother to look at what the previous values of these bits were.. So,

zero or one, we don't care. We'll just blithely go in and set this k bits equal

to one, whatever they were before. So, what about looking up? How are we going to

implement that? Well, all you have to do is check for the footprint that was

inevitably left by a prior insertion. So, if we're looking up an IP address and we

know was inserted sometime in the past, what happened when we evaluated the k hash

functions, we went t o appropriate positions in the array and we set all of

those bits to one. So now, I'll just check that, that indeed happened, that is when

we get a new IP address, we're looking it up. We evaluate the hash functions, all k

of them. We look at the corresponding k positions and we verified that indeed

those k bits have been set to one. So, what I hope is clear fairly quickly from

inspecting this very elegant code is that we will not ever have false negatives,

yet, we might have false positives. So, let's discuss those one other time. So,

remember, a false negative would mean that the Bloom filter says, something isn't

there when in fact, it is, that is we insert something and we'll look it up

later and the Bloom filter rejects us. Well, that's not going to happen. Cuz when

we insert something, we set the relevant k bits to one. Notice when a bit is one, it

remains one forevermore. That bits are never reset back to zero. So, if anything

was ever inserted in the subs when we look it up, definitely we well confirm that all

those bits are one. So, we're never going to be rejected by something we inserted

before. On the other hand, it is totally possible that we will have a false

positive. It's totally possible that there will be a phantom object and we'll

do a look up and the Bloom filter will turn yes when we never inserted that

object. Suppose for example, the k = three. So, we're using three different

hash functions. Consider some IP address, fixed IP address, maybe the three hash

functions tell us the relevant bits are seventeen, 23, and 36. Maybe we never

inserted this IP address, but we have inserted IP address number two and in its

insertion, the seventeenth bit got set to one. We inserted some other IP address, IP

address number three and the twenty-third bit got set to one. And then we inserted

IP address number four and the 36th bit got set to one. So, three different IP

addresses were responsible for setting these three different bits but whatever,

its not like we are remembering that. And that once, once we look up this IP address

we really care about, what do we do, we just inspect bit seventeen, its one.

Inspect the 23, its one. We inspect the 36, its also one. For all we know, this

thing really was inserted and the Bloom filter is going to say, yes, it's in the

table. So, that's how we have false positives. All of the bits that are

indicating whether or not a given object are in, are in the Bloom filter were

previously set by insertions from other objects. So, there are two points that I

hope are clear at this stage of the discussion. First of all, that this Bloom

filter, the idea does suggests a possibility of a super space efficient

variant of a hash table, right. So, we've been talking about setting the number of

bits to be say roughly eight times the number of objects that you're

storing so you're only using eight bits per object and for most objects, that's

going to be radically smaller than just the simple array, storing the objects

themselves. Again, if their IP addresses we're only have 25 percent much space as

we actually stored those IP address in just an array with no extra bells and

whistles. The second point is that we're inevitably going to have errors in a

Bloom filter, we will have false positives or we look something up and it

says, its there when in fact, its not. So, those two points I hope are clear. What's

actually not clear is the bottom line. Is this actually a useful idea? For this to

be useful, it'd better be the case that the error probability can be quite small

even while the space per object is quite small. If we can't get those two things

small simultaneously, this is a bad idea and we should always just use a hash table

instead. So, to evaluate the quality of this idea, we're going to have to do

little bit of mathematical analysis. That's what I'm going to show you in the

next couple of slides.