0:00

Over the years it became clear that DES and triple DES are simply not designed for

modern hardware and are too slow. As a result, NIS started a new process to

standardize in a new block cypher called the Advanced Encryption Standard or AES

for short. Nis started it's effort in 1997 when it requested, proposals for a new

block cipher. It received fifteen submissions a year later. And finally in

the year 2000, it adopted the cypher called Rindall as the advanced encryption

standard. This was a cypher designed in Belgium. We already said that it's block

size is 128 bits and it has three possible key sizes. 128 bits, 192, and 256. Now,

the assumption is that the larger the key size is, the more secure the block cipher

is as a pseudo random permutation. But because it also has more rounds involved

in its operation. The slower the cipher becomes. So the larger the key supposedly,

the more secure the cipher, but also the slower it becomes. So for example, AES 128

is the fastest of these ciphers and AES 256 is the slowest. Now AES is built as

what?s called a substitution permutation network. It's not a Faistl network.

Remember that in a Faistl network, half the bit were unchanged from round to

round. In a substitution permutation network all the bits are changed in every

round. And the network works as follows, so here we have the first round of the

substitution permutation network, where the first thing we do is we X or the

current state with the round key. In this case the first round key. Then we go thru

a substitution layer, where blocks of Date are replaced with other blocks based on

what the substitution table says. And then we go through a permutation layer where

bits are permuted and shuffled around. And then we do this again. We XR with an X

round key, we go thru a substitution phase, and we're permute to dance around.

And so on, and so on, and so forth Until we reach the final round where we x or

with the very last around key, and then out comes the output. Now, and important

point about this design. Is that, in fact, because of how it's built, every step in

this network needs to be reversible, so that the whole thing is reversible. And so

the way we would, decrypt, essentially, is we would take the output and simply apply

each step of the network in reverse order. So we start with the permutation step, and

we have to make sure that step is reversible. Then we look at the

substitution layer, and we have to make sure this step is reversible. And this is

very different from DES. In DES, if you remember, the substitution tables were not

reversible at all. In fact, they map six bits to four bits. Whereas

here, everything has to be reversible, otherwise it would be impossible to

decrypt. And of course the x or with the round key is reversible as well. Okay? So

inversion of a substitution permutation network is simply done by applying all of

the steps in the reverse order. So now that we understand the generic

construction. Lets look at the specifics of AS. So AS operates on a 128 bit block.

Which is sixteen bytes. So what we do with AS is we write those sixteen bytes as a

four by four matrix. Each cell in the matrix contains one byte. And then we

start with the first round so we X over with the first round key and then apply a

certain function. That, includes substitutions and permutations and other

operations on the state. And again, these three functions that are applied here have

to be invertible so that in fact the cypher can be decrypted. And then we xor

with the next round key and we do that again. Again we apply the round function

and xor the round key. And we do that again and again and again. We do it ten

times. Although interestingly in the last round, the mix column step is actually

missing. And then finally, we XOR with the last rounds key, and out comes the output.

Again, in every phase here, we always, always, always keep this four by four

array. And so the output is also four by four, which is sixteen bytes, which is 128

bits. Now the long key themselves of course come from a sixteen byte AS key

using key expansion. So the key expansion maps from a sixteen bytes AS key

into eleven keys, each one being sixteen bytes. So these keys themselves are also a

four by four array that's XORed into the current state. Okay, so that's the

schematic of how AES works. And the only thing that's left to do is specify these

three functions, byte sub, shift row, and mixed column. And those are fairly easy to

explain. So I'm just gonna give you the high level description of what they are.

And, those interested in the details can look it up online. So the way byte

substitution works, is literally it's one S box containing 256 bytes. And

essentially, what it does is it applies the S box to every byte in the current

states. So, let me explain what I mean by that. So the current state is gonna be

this four by four, table. So here we have the four by four table. And to each

element in this table, we apply the S box. So let's call it the A table. And then

what we do is, essentially, for all four by four entries, essentially, the next

step, Aij. Becomes the current step evaluated at the look up table. So we use

the current cell as an entry, as an index, into the look up table. And then the value

of the look up table is what's output. Okay. So, that's the first step. The next

step that happens is a shift pro step, which is basically just a permutation. So

essentially we kind of do a stick lick shift on each one of the rows. So you can

see the second row is stick lick shifted by one position. This third row is

[inaudible] shifted by two positions and the third row is [inaudible] shifted by

three positions. And the last thing we do is mix columns where literally we apply a

linear transformation to each one of these columns. So there's a certain matrix that

multiplies each one of these columns and it becomes, the next column. So these

linear transformation is applied independently to each one of the columns.

Now, I should point out that, so far, shift rows and mixed columns are very easy

to implement in code. And I should say that the [inaudible] institution itself is

also easily computable, so that you can actually write code that takes less than

256 bytes to write. And you can kind of shrink the description of AES by literally

storing code that computes the table rather than hardwiring the table into your

implementation. And in fact, this is kind of a generic fact about AES, that if you

can have allowed no pre computation at all, including computing the S box on the

fly. Then in fact you get a fairly small implementation of AES, so it, it could fit

on very constrained environments where there isn't enough room to hold,

complicated code. But of course, this will be the slowest implementation, because

everything is computed now on the fly, and as a result, the implementation,

obviously, is gonna be, slower than things that were pre-computed. And then there is

this trade off. For example if you have a lot of space, and you can support large

code. You can actually precompute quite a bit of the three steps that I just

mentioned. In fact, there are multiple options of pre-computation, you can build

a table that's only four kilobyte big. You can build a table that is even longer,

maybe 24 kilobytes. So basically you will have these big tables in your

implementation. But then your actual performance is going to be really good,

because all your doing is just table look-ups and XORs. You're not doing

any other complicated arithmetic. And as a result, if you can do a lot of

pre-computation, these three steps here, [inaudible] should. [inaudible] and mixed

columns can be converted just into a number, a small number of table lookups

and some [inaudible]. All you can do is just compute the S box, so now your

implementation would just have 256 bytes. Hard coded The rest would just be code

that's actually computing these three functions. The performance would be slower

than in the previous step but the code footprint would also be smaller. So in

overall, there's this nice tradeoff between code size and performance. So on

high-end machines, on high-end servers, where you can afford to have a lot of

code, you can precompute and store these big tables and get the best performance.

Whereas on low-end machines like eight byte smart cards or think of like an eight

byte wristwatch, you would actually have a relatively small implementation of AES.

But as a result of course it won't be so fast. So here's an example that's a little

unusual, suppose you wanted to implement AES in Javascript so you can send an AES

library to the browser and have the browser actually do AES by itself. So in

this case what you'd like to, to is you'd like to both shrink the code size, so that

on the network there's minimum traffic to send a library over to the browser but, at

the same time, you'd like the browser performance to be as fast as possible. And

so this is something that we did a while ago essentially the idea is that the code

that actually gets send to the browser doesn't have any pre computer table and as

a result is fairly small code. But then the minute it lands on the browser, what

the browser will do is actually pre compute all the tables. So in some sense

the code goes from just being small and compact. It gets bloated with all these

precomputed tables. But those are stored on the laptop, which presumably has a lot

of memory. And then once you have the precomputed tables you actually encrypt

using them. And that's how you get the best performance. Okay? So if you have to

stand in implementation AES over the network, you can kind of get the best of

all worlds. Whereas, the code over the network is small, but when it reaches the

target client, it can kind of inflate itself. And then get the best performance

as it's doing encryption on the clients. Now AES is such a popular block cipher,

now essentially when you build crypto into products essentially your supposed to be

using AES, and as a result Intel actually put AES support into the processor itself.

So since Westmere there are special instructions in the Intel processor to

help accelerate AES. And so I listed these instructions here. They come in two pairs,

aesenc and aesenclast. And then, there's aesekygenassist. So, let me explain

what they do. So, aesenc essentially implements one round of AES. Namely, apply

the three functions in the XOR with the round key. And aesenclast basically

implements the last round of AES. Remember, the last round didn't have the

mix columns phase. It only had the subs bytes and shift rows. And so that's why

the aesenclast does. And the way you call these instructions is using 128 byte

registers which correspond to the states of AES. And so you would have one register

containing the states and one register containing the current round key. And then

when you call AES on these two registers, basically, they would run one round of AES

and place the result inside of this XMM one state register. And as a result if you

wanted to implement the whole AES. All you would do is, call aesenc nine times. Then

you would call aesenclast one time. These ten instructions are basically the entire

implementation of AES. That's it. It's that easy to implement AES on this hardware

and they claim because these operations are now done inside the processor not

using external instructions that are implemented in the processor. They claim

that they can get a fourteen X speedup over say an implementation that's running

in the same hardware but implementing AES without these special instructions. So

this is quite a significant speed up and the facts are now lots of. Products that

make use of these special instructions. And let's just say that this is not

specific to Intel, if you're in [inaudible], and they also implemented

exactly kinda similar instructions in their bulldozer architecture and further

and future architectures. Okay, so let's talk about the security of AES. I wanna

mention just two attacks here. Obviously, AES has been studied quite a bit. But the

only two attacks on the full AES are the following two. So, first of all, if you

wanted to do key recovery, the best attack, basically, is only four times

faster than exhaustive search. Which mean that instead of a hundred and. 28 bits

key, really you should be thinking of AES. Is 126 bit key. Cause exhaustive search,

really it's gonna four times faster than it should. Of course due to the 126, it's

still. More time than we have to compute, and this really does not hurt the security

alias. The more significant attack is, actually, on AES-256. It turns out there's a

weakness in the key expansion design of aes which allows for, what's called a

related key attack. So, what's a related key attack? Essentially, if you give me

about two to the 100 input output pairs for aes, but from four related keys. So,

these are keys that are very closely related, namely key number one. Is just

one bit flip of key #two, which is just one flip, bit flip of key #three, which is

just one flip bit flip of key #four. These are very closely related keys, if you like

your [inaudible] distances very short. But if you do that, then, in fact, there is a

two to the 100 attack. Now you should say, well, two to the 100 is still impractical.

This is still more time than we can actually run today. But nevertheless, the

fact that it's so much better than an exhaustive search attack, it's so much

better than two to the 56, is kind of a limitation of the cipher. But generally,

it's not a significant limitation, because it requires related keys. And so, in

practice, of course, you're supposed to be choosing your keys at random, so that you

have no related keys in your system. And as a result, this attack wouldn't apply.

But if you do have related keys, then there's a problem. So this is the end of

the segment, and in the next segment we're gonna talk about more provably secure

constructions for block ciphers.