0:03

So let's just look at a little bit of the context of hashing in practical

applications. As I mentioned, it's very widely used. So here's an, here's an

example right from Java. The first. Implementation of Java 1.1. The designers

found that the cost of computing the hash function for strings seemed to be

excessive, particularly for long strings. And that was one of the main uses of

hashing, was just to be able to do searching with string keys. And so what

they decided in the first implementation was let's just look at every eighth or

ninth character, and that way, we don't have to spend a lot of time computing the

hash function. So they had a hash function pretty much like the one that we use.

Except that it compute a skip that would mean that, that only look at about every

eight key and they wouldn't have to do quite so much work performing the hash

function. And that's diffidently one thing to consider when using hashing is that the

cost of computing the hash function for a complicated key might exceed the cost of

searching and using a simpler structure like a binary search tree. And anyway for

Java 1.1 what happened was that there was a huge potentail for really bad collision

patterns on typical data. So here's the example of typical data, which is a URL.

All of these keys, which are totally different, would wind up having the same

collision. And so client programs and system programs on the Java system were

having terrible performance on their symbol table because of the shortcut in

hashing. So this well illustrates that you need to use all of the data in the hash

function and sometime we do a closer analysis. The cost of computing the hash

function can mean that something like red black trees will even outperform hashing

even for just searching and insert. So there is another thing about the uniform

hashing assumption is that it is an assumption and if you are writing code

where we have to have guaranteed performance like when your aircraft is

landing or you are controlling a nuclear reactor or somebody's pa cemaker. That, if

that assumption doesn't hold and you get bad performance you're going to have

disastrous consequences. So that's another reason to think about maybe paying a

little extra and using to guarantee that you get with red black search trees.

2:47

Instead of hashing. And there's another surprising situation that happens in

today's world. For example Java publishes its hash function. And so if you're trying

to provide a service over the web. An adversary can learn your hash function and

just send you data that causes huge performance problem by just making all

that data hash to one particular item. And that's definitely something to worry

about. And, and in the real world you can nowadays find on the web particular

sequences of keys that will cause particular services to crash. And again,

that's a little harder to do with something like a red black tree where we

have performance guarantees. When you make an assumption you better be sure and

you're depending on that assumption, you better be sure that it holds somehow. This

is different than for example for quick sort when we, our assumption was we're

going to create randomness and we are going to depend on that randomness. In

this care we're kind of hoping for randomness and maybe that doesn't really

always hold. So that's certainly something to be aware of when using hashing in

practice. So here's just simple example on hashing in java. So what we can do is it's

pretty easy to find a family of strings that have the same hash code for example

with just a little fooling around now days you can just look it up on the web, you

can see that these two character keys, both have the same hash code because when

you just do the math in a base 31 hash code it'll tell you that answer. Well what

that means is that actually, just like working in binary you got, you can combine

those things. In all possible ways, and you can get two to the N strings, for any

N of length to N that all hash to the same value. And somebody's implemented a

service in Java that it uses a simp le table that takes string keys, you can

cause that to crash in this way. Little bit scary for some systems designers. At

least reason for pause in using hashing. Now, hashing also has a extremely

important application in today's Internet commerce. And so the, it's the concept of

so called one way hash functions which mean that we, we, use it for secure to try

to be, have some secure fingerprints for use on the web. And there's been a lot

research done to develop functions that take keys as input, and then produce

values that look random. In such a way that, it's hard for someone else to find

another key that collides with that. This technology is, is useful for storing

passwords and digital fingerprints and things. But it's too expensive for use, in

a symbol table. So the bottom line is separate chaining versus linear probin

collision resolution message methods. Now there's a number of considerations to take

into account. Separate chaining is really easy to implement both insert and delete

it performs, it degrades, it does so gracefully and the clustering is, is maybe

less of a problem if you have a bad hash function. Linear probing tends to make

better use of space. And also it'll perform better for huge tables whereas

caching is involved. And if, in the classic algorithm or computer science

problems for people to think about is what do we do to delete in these two situations

and exactly how do we resize. Those are all at the level of exercises in the

context of the kinds of algorithms that we've seen. And as I mentioned, there's

been many, many improved versions of hashing that have been studied. I

mentioned the two probe, or double hashing version. Another way to use two hash

functions is just to hash the two positions and put the key in the shorter

of the two chains. In, in that case, then the expected length of the longest chain

will be lg, lg N which is quite an improvement. You get constant time

expected and lg, lg N worst case. Double hashing is the variant of layer probing

where you just skip a variable amount, not one each time. And that pretty much wipes

out clustering but it, it is more difficult to implement delete for that

one. In a new method called, relatively new method called Cuckoo Hashing. It's

another variant of linear probing where we hash a key to two positions and insert the

key in either one. If occupied you, you reinsert the displaced key into its

alternative. It was in one, each one can go to two. And that one actually gives

constant worst case time for search. That's another variation on the team. And

all of these things allow us to make better use of memory, allows the table to

become nearly full. It would have been very exciting. Thing to be researchers in

the 1950's who cared so much about memory and nowadays a little extra memory is not

something that people care about so much and most people just go with the easy

algorithm except for really performance critical applications. What about hash

tables versus balance search trees? Well hash tables are really simple to code

usually if you don't have to do the hash function. And if you don't have order in

the keys at all then you need the compare to, to implement balance search trees. So

you have to use hashing if you don't have the comparison. And it'll probably be

faster for simple keys to use hashing. It's a few arithmetic operations to do the

hash versus lg N and compares for the balance tree. And there's some better

system support in Java for strings that cache hash code means that you don't even

have to compute the hash if your, your simple table for strings is in an inner

loop. On the other hand, balanced search trees have a much stronger performance

guarantee. It, you don't have to assume anything. It's going to be less than lg N

and compares and it's got support for all those ordered ST operations, and compared

to and is pretty easy and natural function to implement. So it's more flexible and

more broadly useful. And actually the Java system and other systems include both so

that programmers can make use of either one in diff erent situations. That's our

context for hashing algorithms.