0:00

In part one of this lesson, we looked at the preimage and second preimage attacks,

which are both order 2 to the n when carried out via brute force,

which is the only way they can be carried out against a random oracle.

Now we get to a very interesting attack that is quite non-intuitive.

In a hash collision attack, all we are interested in doing is finding

two messages that happen to have the same digest.

And we don't care what the specific digest is.

Intuitively, we probably expect this to be little different from a second

preimage attack.

After all, it would appear to be essentially the same thing.

We want two messages that have the same digest.

But appearances can be deceiving because the seemingly minor distinction that

we don't care what the actual digest is,

makes all the difference due to what is known as the birthday paradox.

And which is why this attack is usually referred to as a birthday attack.

Many practical breaks in real cryptosystems,

such as the breaking of WEP keys and

the recently demonstrated break in SHA-1, involve some kind of birthday attack.

So let's spend some time getting familiar with this.

The Birthday Paradox asks a very simple question that most people,

the first time they see it, can confidently guess an answer that

is almost always not even close to being correct.

How many randomly chosen people do I need to have in a room

before there is a 50/50 chance that at least two people have the same birthday?

Let's simplify the problem by ignoring leap years, so we have 365 possible

birthdays, and also assuming that the distribution of birthdays is uniform.

Before answering this question, let's look at the analog to a preimage attack

that would be how many people I have to gather in a room before there's

a 50/50 chance of there being someone that has a specific birthday?

I can make this a second preimage attack by just specifying that I want

to find someone that has the same birthday as me.

2:11

We already talked through the solution to this problem, which for small d,

the number of days or the number of digests, is given exactly by

this expression, which requires about 253 people on average.

This may or may not seem too surprising to you,

since it might not seem that different from 183 people.

But it's important to understand why it is significantly more than half.

Remember that we are asking how many people we need to consider before finding

someone that has a specific birthday.

If you think about it, you'll realize that you we're actually never guaranteed of

finding someone that has that birthday no mater how many people we look at.

One way to think of this is that there's a finite,

albeit vanishingly small, possibility that no one else on this planet

happens to have been born on the same day of the year as me.

3:04

Now let's consider the analog to the hash collision attack.

How many people do I need to consider before there is a 50/50 chance of finding

two people that happen to have the same birthday?

The first major difference is that because of the pigeonhole principle,

if I put 366 people in a room,

I am absolutely guaranteed that there are two people with the same birthday.

So as long as there are more messages than digests, I will find a collision.

But how many people do I need to be in that room before I have

a 50/50 chance of succeeding?

We can get at this by calculating the probability that

none of the k people currently in the room have the same birthday,

which we can do by adding people to the room one at a time.

If only one person is in the room,

then the likelihood that no two people in the room have the same birthday is 100%.

Or another way of saying this is that the one person can have any of

the 365 birthdays out of the 365 available birthdays.

When I put the second person into the room, the chance that they don't share

a birthday with the person already there is 364 divided by 365,

because they can have any birthday except one.

When I add the third person,

they can have any birthday except two, or 363 out of 365.

The next person can have any of 362 birthdays.

This pattern continues all the way down to the kth person.

We can now see mathematically why the 366th person

guarantees a solution, since the term associated with them is identically zero.

Meaning that the probability is zero

that no two people in the room have the same birthday.

We can write this using product notation.

We can also express it using factorial notation.

While there is no closed form solution for k, computing this probability using

one of these expressions is actually quite easy if D and

k are pretty small, especially if we use a simple spreadsheet.

We can work up incrementally from one person until the probability drops to less

than 50% since that means that the probability that at least two people do

have the same birthday has just risen above 50%.

Doing this, we find that the number of people needed to reach the 50/50

threshold is a surprisingly low 23 people.

6:25

This leaves us with a very simple expression that we can solve

directly for k.

We see that we just need a little more than the square root of the number

of digests.

As a sanity check for D equals 365 days,

we get 22.5 people, which is in excellent agreement with our previous results.

So we can have very good confidence that we haven't done anything wrong.

For a hash function, we can write D in terms of the size of the digest.

The key observation is that the number of hashes that must be computed is on

the order of the square root of the number of possible digests.

It also means that while the difficulty of a hash collision attack is still

exponential in the size of the hash, the effective number of bits is

only half compared to the difficulty of a preimage or second preimage attack.

This is extremely significant.

An example in a prior module described the related key attack against the original

WIFI WEP encryption which had a 64-bit key, but

only 24 bits of it actually changed with each message.

This meant that the effective number of keys were about 16 million.

But since it was only necessary to find two packets that used the same key,

a birthday attack only required about 5,000 packets in order to recover the key.

An attack that could be carried out in real time.

Recall from another example, that if we calculate a million digests a second,

then it would take half a million years to carry out an exhaustive preimage or

second preimage attack against a 64-bit hash function,

though we would have an even chance of finding a solution in about 400,000 years.

A birthday attack, on the other hand,

would have an even chance of finding a collision in less than an hour and a half.

This is why adversaries go to great lengths to figure out ways to perform

birthday attacks against cryptosystems, and

why some form of birthday attack is often the first successful crack into a system.

The first hash collision involving SHA-1 was published in early 2017, and

was the result of finding weaknesses that reduced the expected number of digests

that needed to be calculated from the 2 to the 80th needed against a random oracle,

down to about 2 to the 63rd,

which still cost about $100,000 worth of computer time to carry out.

While expensive, however, this is well within the reach

of large criminal organizations, not to mention national intelligence agencies.

Whether this capability is worth the cost for either is debatable.

And for the foreseeable future,

it is still highly unlikely that anyone will actually be able to benefit from it.

But security specialists don't think in these terms if they don't have to.

And thus, the use of SHA-1 is very quickly on the way out.

For instance, most major Internet browsers announced several years ago,

well before the first and so far only actual break,

that they will no longer accept SHA-1 signed certificates by the end of 2017.

Present best practices recommend that a 160-bit hash function

is the smallest that should be used if any kind of birthday attack is conceivable,

but 256-bit is required for most classified cryptosystems.

9:40

In the two parts of this lesson,

we've looked at the three attacks against the Random Oracle, and

seen that both preimage and second preimage attacks are ordered 2 to the n.

But the hash collision attack is order square root of 2 to the n,

which makes it extremely powerful.

Nearly all other attacks against real hash functions exploit

weaknesses in the algorithms in an attempt to reduce the amount of computational

effort to agree where some kind of brute force attack becomes feasible.

We are now at the point where we can discuss the properties that a real

cryptographic hash function should have.