0:00

We're finally in a position to describe the most important behaviors

that the cryptographic hash function needs to have.

Like most useful hash functions, we need it to be deterministic, meaning that,

given the same input, it will always produce the same output.

This sounds pretty obvious, but

there actually are hash functions that are probabilistic.

And there are even some cryptographic uses for them.

But the purposes we have been discussing require determinism.

0:27

Although not a top priority, we do need it to be fast.

We don't want this to become the bottleneck in our ability to carry out

high speed communications.

But in general, we're usually willing to, and usually have to,

sacrifice a lot of speed before we start giving up performance in the other areas.

As a result, cryptographic hash functions are commonly hundreds of times slower

than their non cryptographic peers.

But in addition to raw speed,

we also want the hash function to be computationally simple.

So that it can be implemented in hardware and run on slow,

memory starved devices such as the chips used in credit cards.

While modern devices are incomparably faster and

more capable than devices from the latter half of the 20th century,

which is when many hardware oriented Cryptographic algorithms where pioneered.

The amount of complexity we can throw at the function and

still achieve acceptable speed and cost does require compromises in other areas.

1:27

For instance, there were many provably secure hash functions

based on known mathematical problems that are believed to be quote hard, so hard,

in fact, that pretty much all of public key cryptography is based on them.

But these hash functions are seldom used, because they are too slow and

too computationally demanding to be acceptable for most applications.

Instead, most mainstream hash functions are actually quite ad hoc,

relying on design principles that have evolved somewhat by trial and error and

that really amount to little more than people hoping that they are secure enough

based on a lot of collective experience, but that they can't prove.

And this doesn't always work out in the long run as evidenced

by a host of eventual breaks against several of them.

2:14

Thus we have another example of the reality that in most nearly

any form of engineering, there were seldom optimal solutions available and

practical solutions represent compromises.

At the end of the day, it comes down to carefully considering what

must be achieved, to consider the solution good enough.

And then accepting that if you have a solution that is good enough,

then it's good enough.

As the saying goes, there comes a point in the history of every project when

it's necessary to shoot the engineers and begin production.

We often have to do this knowing that what is good enough today

may not be good enough tomorrow.

So this is something that will tend to haunt us as we try to achieve

the rest of our design goals.

One of the more important of those goals is that we want the digest for

any two messages, no matter how closely related, to be uncorrelated.

Meaning that any similarity between the digests of two different messages

is as close to pure chance as we can make it.

This is most strongly evidenced by the behavior that even a single bit changed

anywhere in the message,

even the very last bit, results in about half of the bits in the digest flipping.

This requires what is known as strong avalanche,

in that every bit of the message has the ability to affect every bit of

the digest with about an even chance of actually doing so at any given time.

The effect of achieving this is that it becomes extremely difficult,

ideally impossible, to take a message and examine it's digest.

And then make an incremental change to the message with the intent of moving

the digest a little bit closer to the desired target.

If we could do this, then we could perform some type of differential or

gradient descent attack on the hash function

to find collisions much more efficiently than brute force.

A number of hash functions have fallen prey to such attacks.

So achieving this lack of correlation is very important.

Unfortunately, proving that we have achieved it is impossible for

most existing computationally efficient hash functions.

4:18

In light in our earlier discussion of the attackers goals,

we need to function to be preimage resistant and

also second preimage resistant, which large it comes down to it being

a not invertible function, also known as a one-way function.

This in turn, means that we can't take the hash failure and

construct a message from it that will produce that hash.

If we could walk backward through the hashing algorithm,

then we could eventually work our way back to a suitable message.

It's tempting to think that since the most hash functions are implemented as pretty

basic combinatorial building blocks, that we should be able to take the outputs and

figure out the possible inputs that can produce it.

After all, if nothing else,

we should be able to write a truth table and then simply reverse it.

The result would be, for any given output, a set of possible inputs and

possibly a set of outputs that can never happen.

The problem is, once again, one of scale.

Many hash functions use 32-bit building blocks.

For instance, SHA-1 uses 5 such blocks for 160 bits of internal state.

And the primary hashing element is a block that combines 3 of these

blocks via simple but nonlinear bitwise logic functions.

So while there are only 4 billion possible outputs,

there are two to the 96th possible outputs.

Enumerating this table gets us right back into those

astronomically large numbers we've seen before.

And that's just one piece of the process.

To generate the final 32-bit output that becomes just part of the next state,

not only are 160 bits of the previous state used, but

also 32 bits that are derived from the message and

a 32-bit constant that changes depending on what stage the algorithm is at.

6:06

Thus just to determine 32 bits of the next state, not even the final

output of the hash, there are 204 bits of information that are used.

To make matters worse, this process is repeated a total of 80 times for

each 512 bits of the message.

6:23

This is not to say that this logic function can't be inverted.

As already emphasized, we'd like to base our functions on hard math problems and

this is possible.

It's just not computationally practical yet.

So we use functions that really mix things up and

then we repeat this process over and over.

But invariably,

some of the steps will be found to be somewhat invertible, which we can use to

reduce the amount of effort required to carry out a sophisticated attack.

When that happens, we use what has been learned to try to avoid incorporating

similar weaknesses into future generations of functions.

It is very much a black art as well as a science.

We also need it to be collision resistant,

meaning that it is hard to find any two messages that yield the same hash value.

The goal here is to make the best way of finding collisions a brute

force birthday attack.

Again, real hash functions that are not based on hard problems are subject to

death by 1,000 cuts.

By finding enough small but exploitable weaknesses to reduce the number of digests

needed to carry out a birthday attack to a doable number.

This is essentially what happened to SHA-1, where a number of weaknesses were

discovered, the net effect of which was to reduce the number of digests needed from 2

to the 80th down to about 2 to the 63rd, a reduction by about a factor of 100,000.

And even so it took about a year to carry out the attack using some massively

parallel computing resources.

7:52

Next let's consider a couple of interactions between these properties.

Any hash function that is collision resistant

is also second-preimage resistant.

After all, if I have a function that is not second-preimage resistant,

then it is not collision resistant, since the result of a successful

second-preimage attack is a hash collision.

However, collision resistance does not imply preimage resistance,

at least in the strict sense.

That might seem rather strange,

since it implies that we could have a hash function for which, given a digest, or

at least some digest, we can efficiently find a message that produces that digest.

But that we can't efficiently find any collisions.

8:35

But it's actually quite easy to construct hash functions for

which this is true, though they tend to be pathological,

meaning that they have a weakness, not because we discovered it, but

because we designed it specifically so as to have that weakness.

One example is our identity hash function

that simply sets the digest equal to the message.

It is trivially invertible, but no collisions exist.

But this requires that we restrict the message to being the same size as

the digest, which few real hash functions do.

So if this is the only way we can get this behavior,

then it is of no real consequence.

But can we construct a hash function that takes arbitrarily sized messages and yet

still has this behaviour?

Consider the following N-bit hash function, if a message happens to be

exactly N-1 bits, then our digest is 0 appended with the message.

Otherwise, we run the message through an N-1 bit collision resistant hash function,

perhaps a Random Oracle.

And our digest is a 1 appended with its output.

Now if you give me a digest that starts with a 0,

I can give you a message that produced that digest.

The fact that I can't efficiently do this for

a digest starting with a 1 doesn't matter.

In order to be resistant, it must be on the order of brute force for all digests.

In practice, any hash function that is collision resistant

is probably preimage resistant, but we can't just assume that it is.

10:06

This has been a short, succinct lesson, so our review isn't going to to be very much.

Cryptographic hash functions need to be deterministic, uncorrelated, efficient,

preimage resistant, second preimage resistant, and collision resistant.

No real hash function is likely to achieve all of these goals, and

instead are going to represent compromises between them.

Unfortunately, the nature of the compromises are often such that

the implications of how much we are giving up in the other areas

often are not really known, which leaves the door open for Mallory and his friends.