In this module, we're gonna talk about a new concept called collision resistance, which plays an important role in providing message integrity. Our end goal is to describe a very popular MAC algorithm called HMAC, that's widely used in internet protocols. HMAC itself, is built from collision resistant hash functions. Before we do that, let's do a quick recap of where we are. In the last module we talked about message integrity where we said that the MAC system is secure if it's existentially unforgeable under a chosen message attack. This means that even an attacker who is given the tag on arbitrary messages of his choice cannot construct a tag for some new message. And then we showed that in fact any secure PRF immediately gives us a secure MAC. And so then we turned around and said well can we build secure PRFs that take large messages as inputs? And so we looked at four constructions. The first construction was based on CBC, we call it when we looked at two variants of it, one called encrypted CBC and one called CMAC. And we said that these are commonly used with AES. In fact, in the 802.11i standard, a CBC-MAC is used for message integrity. In particular, with the AES algorithm. We looked at another mode called NMAC, which also converts a PRF for short inputs into a PRF that's capable of taking very large messages as inputs. And these two were both sequential MAC-s. We then looked at the parallelizable MAC called PMAC which again was able to convert a PRF for small inputs into a PRF for very large inputs. But it did so in a parallel fashion so a multiprocessor system would be more efficient with PMAC than, say, with CBC-MAC. All three of these built a MAC for large messages by constructing a PRF for large messages. And then we looked at the Carter-Wegman MAC which is actually not a PRF. It's a randomized MAC so a single message could actually have many, many different valid tags and therefore a Carter-Wegman MAC is actually not a PRF. And if you remember, the Carter-Wegman MAC was built by first of all, taking the bulk message and hashing it down to a small tag using a fast one-time MAC and then encrypting that tag using a PRF. The benefit of the Carter-Wegman MAC was that, as we said, the hashing of the bulk message is done using a fast one time MAC. And then in this module, we're gonna construct MAC-s from this new concept called collision resistance. And so the first thing we're gonna do is construct collision resistant hash functions. So let's first of all start by defining what does it mean for a hash function to be collision resistant. So think of a hash function from some message space to a tag space T, and you should be thinking of the message space as much, much bigger than the tag space. So the messages could be gigabytes long, but the tags would only be like 160 bits. Now a collision for the function H is a pair of messages m0, m1, that happen to be distinct. However, when you apply the function H to them, you end up with the same output. So the image you should have in your mind is essentially there are two inputs, m0 and m1, and they belong to this gigantic message space. However, when we apply the hash function to them, they happen to collide and they both map to the same output in the tag space. Now we say that the function is collision resistant if it's hard to find collisions for this function. Now this should seem a little bit counterintuitive because. We know that the output space is tiny compared to the input space. So, by the pigeonhole principle, there must be lots and lots and lots of messages that map to the same output. Just because there isn't enough space in the output space to accommodate all the messages without collisions. And so, we know that there are lots of collisions, and the question is, is there an efficient algorithm that finds any such collisions explicitly. So we say the, the function is collision resistant, if, for all explicit efficient algorithms A. And these algorithms are not able to print the collision for the function H, okay? And as usual, we'll define the advantage as the probability that the algorithm A is able to output a collision. Now I'm not gonna formalize a term explicit here. All I'll say is that it's not enough to just say that an algorithm exists, and algorithm that simply prints a collision. Cause we know many collisions exist. What we actually want is an explicit algorithm that we can actually write down and run on a computer to generate these collisions. There are ways to formalize that, but I'm not gonna do that here. Now a classic example of a collision resistant hash function is SHA-256 which happens to output at 256 bits but can take arbitrary large input. For example, it can take gigabytes and gigabytes of data and it will map it all to 256 bits. And yet nobody knows how to find collisions for this particular function. So just to show you that this concept of collision resistance is very useful, let's see a very quick application for it. In particular, let's see how we can trivially build a MAC given a collision resistant hash function. So, suppose we have a MAC for short messages. So you should be thinking something like AES, which can MAC sixteen byte messages. And then, suppose we have a hash function, a collision resistant hash function from a large message space, that contains gigabyte messages into our small message space. Say, into sixteen byte outputs. Then, basically, we can define a new MAC, let's call it Ibig, which happens to be MAC-ing large messages. And we'll define it simply by applying the small MAC to the output of the hash function. And how do we verify a MAC? Well, basically, given a tag we verify it by rehashing the given message and then checking that small MAC actually verifies under the given tag. Okay, so this is a very simple way to show you how collision resistance can take a primitive that's built for small inputs and expand it into a primitive that's built for very large inputs. And in fact we're going to see this not just for MAC-s. Later on when we talk about digital signatures, we're going to do the same thing. We're going to build a digital signature scheme for small inputs and then we're going to use collision resistance to expand the input space and make it as large as we want. So the security theorem basically isn't something that's trivial here. Basically it says if the underlying MAC is secure and H is collision resistant, then the combination which can actually MAC large messages, is also a secure MAC. And as a quick example, let's apply this to AES. So let's use the one example that we mentioned, SHA-256. So in this case SHA-256 outputs 256 bit outputs, which is 32 bytes. So we have to build a MAC that can MAC 32 byte messages. And the way we could do that is basically by applying the sixteen byte AES, plugging it into a two block CBC. A two block CBC would expand AES from a PRF on sixteen bytes to a PRF on 32 bytes. And then take the output of SHA-256 and plug it into this two block CBC based on AES. And then we get a very, very simple, MAC which is secure assuming AES is a PRF and SHA-256 is collision resistant. So one thing I wanted to point out is that in fact collision resistance is necessary for this construction to work. So in fact, collision resistance is not just a made-up term. Collision resistance really is kind of the essence of why this combined MAC is secure. And so let's just assume for a minute that the function H, the hash function we're using, is not collision resistant. In other words, there is an algorithm that can find two distinct messages that happen to map to the same output. In this case, the combined MAC. Is not going to be secure because what the adversary can do is simply use a chosen message attack to get a tag for m0. And then output m1 comma that tag as a forgery and indeed T is a valid MAC for m1 because H(m1) happens to be equal to H(m0). And so in doing so just with a one chosen message attack the attacker was able to break this combined MAC simply because the hash function was not collision resistant. So it should be, again I want to emphasize that if someone advertised even one collision for SHA-256, you know two messages, just one pair of messages that happen to have the same output under SHA-256 that would already make this construction insecure cause an attacker could then ask for the tag on one message and in doing so he would obtain the tag on the other message as well, and that would count as an existential forgery. Okay, so already, we see the collision resistance is a very useful primitive, in that it lets us expand the input space of cryptographic primitives. I wanna show you one more application where a collision resistance is directly used for message integrity. Imagine again, we have files that we wanna protect. Let's imagine these files are actually software packages that, we wanna install on our system. So here are three different software packages. You know, maybe one is GCC, on is Emacs, and another one is, I don't know, VI. Now the user wants to download the software package and he wants to make sure that he really did get a version of the package that he downloaded and it's not some version that the attacker tampered with and modified its contents. So what he could do is basically refer to a read-only public space that's relatively small. All it has to do is hold small hashes of these software packages. So there isn't a lot of space needed here. The only requirement is that this space is read-only. In other words, the attacker cannot modify the hashes stored in this space. And then, once he consults this public space, he can very easily compute the hash of a package that he downloaded. Compare it to the value in the public space. And if the two match. Then he knows that the version of the package he downloaded is, in fact, a correct one. Why is that true? Well, because the function H is collision resistant. The attacker cannot find an F1 path, they would have the same hash as F1. And as a result, the attacker cannot modify F1 without being detected because there's no way that the hash of his F1 [inaudible] would map to the value that's stored in the public space. So, the reason I brought up this example is, I wanted to contrast this with the MAC example. We kinda saw a similar situation with MAC-s. In the MAC example, we needed a secret key to verify the individual file tags. But we didn't need a resource that was a read only public space. With collision-resistant hashes, we kind of get the exact compliment where in fact we don't need a key to verify. Anyone can verify. You don't need a secret key for it. However, now all of a sudden we need this extra resource which is some space that the attacker cannot modify. And, in fact, later on, we're gonna see that with digital signatures, we can kind get to the best of both worlds, where we have both public verifiability, and we don't need a read only space. But so far, with either MAC-s or collision resistance, we can have one, but not the other. So, I'll tell you that, in fact, this kind of scheme is very popular. In fact, Linux distributions often use public space where they advertise hashes of their software packages. And anyone can make sure that they downloaded the right software package before installing it on their computer. So this is, in fact, something that's used quite extensively in the real world. Okay, so in the next segment we'll talk about generic attack on collision resistance and then we'll go ahead and build collision resistant hash functions.