The overview of the module is the following. First, we remind ourselves of the history of this topic. Which goes back quite far with Gauss inventing fast Fourier transform, even before Fourier defined it. Then we will look at small examples, just to get the feeling of what is going on with this matrix-vector multiplications that can be sped up by the fast Fourier transform or FFT algorithm. And we present the most famous case which is the Cooley-Tukey FFT. And we do one example great detail which is the so-called Decimation-in-Time FFT for lengths 2 to the N fast Fourier transforms. And we conclude. So Fourier, of course, had the Fourier transform. If you are remembered of Fourier series, that's an invention by Joseph Fourier. And clearly if you compute Fourier series, it helps to have a fast algorithm but at that time it wasn't a discrete time problem. It was a continuous time problem with Fourier series. But it turned out another great mathematician, Gauss, had a similar problem computing some trajectories of planets. He had to compute trigonometric series and he was doing them by hand. He figured out a quick way to compute trigonometric series when the frequencies where actually harmonic. And it turns out this was essentially the fast Fourier transform already, even before Fourier defined Fourier series. So that leads us back to 1805, when Gauss wrote down in his personal notebook in Latin, a quick way to compute trigonometric series. Fourier, as we know, invented Fourier series just a little bit later. Then a lot of people started computing Fourier series, spending a lot of time developing tricks. And in 1958, an English statistician, Good, came up with an algorithm which is still used but which is not the most popular one. The most popular one was invented in 1965 by Cooley and Tukey. And it's a re-discovery of the fast Fourier transform algorithm for N, when N is a power of a prime, typically is a power of 2 or a power of 3. After this, Shmuel Winograd combined all the methods in giving the most efficient possible fast Fourier transform algorithms. Okay, so the DFT matrix. We have seen this object before. We need an Nth root of unity. That's here e to the -j2 pi over N. And we often write just W because WN becomes a little bit heavy in the matrix context. So powers of N can be taken modulo N because of this relationship. So why is this true? Well, if you raise this to the power N, it will take out the divider by N, and this is equal to 1, okay? So we can take the powers always modulo N, very important point. This will lead to a lot of simplification. So a DFT matrix of size N by N has entries. Remember, the powers go from 0 to N-1, from 0 to N-1, and here from 0, 1, 2, 3, N-1, etc. And we see this very particular matrix with these different entries and we are going to look at a few examples just to become very familiar with this structure. Okay, so when n is equal to 2, we have e to the -j2 pi over 2. And this is equal to -1, okay? So the first line of the W2 matrix is 1 and 1, the second line is 1, -1. Okay, no sweat. So the second case, W3, has powers, here, 1 and 2, 2 and 4. We take modulo 3, so you can see the 4 becomes a 1 here. So this matrix is already a little bit simpler than that one and here we wrote out the actual values in terms of complex numbers. So this is simply e to the -j2 pi over 3. 4 pi over 3, 4 pi over 3, and here again, 2 pi over 3. W4, okay, so what is W4? It's e to the -j2 pi over 4 which is e to the -j pi over 2. Which of course, is equal to -j, okay? And this is a matrix W4, written out as given in the general formula. Here we take modulo 4, so the W4 becomes a 1. W6 becomes a 2, okay. W6 becomes a 2, and W9, modulo 4 is just W, okay? And now we know that W is -j so we can write this out. And this matrix is very, very simple. First line is all 1's, first column is all 1's. Then -j, -1, j, 1, -1, 1, j,-1, -j. Very simple matrix. Now, when you see this matrix you really notice that if I multiply W4 times a vector x of length 4, I don't need many multiplications, right? I mean, for example, here it's just a sum. Here I have to exchange real and complex imaginary entries. Here I have to change the sign. Here I just have to take this sum but with change of signs, etc. So W4 times x is not going to require any multiplications, okay, only additions and subtractions. And interchanges of real and imaginary parts. So the reason we go through this small matrix is to see that there is a lot of symmetry, a lot of structure that can be taken advantage of. Okay, W5 is more complicated, okay? It doesn't have these obvious simplifications that's only in modulo 5. Simplifications of, for example, this character here, well, W16 modulo 5, the closest modulo 5 is 15. So that be W and so on. Okay, now this character does not have an obvious simplification. That requires quite a bit more thinking to see what structure is. But so be it for now. And just one more, W6 starts to be a momentous matrix with a lot of entries. This is when you take the exponents modulo 6. Okay, so what is the idea behind the fast algorithm for the discrete fourier transform. It is this latin phrase, divide et impera, or divide and conquer. That usually is attributed to Julius Caesar. It's the idea that you take a large problem. You divide it into subproblems that you have a better handle on, okay? So Take a problem, split it into two subproblems. And if you could do this once, you can do it again. And assumes these are simple problems by now. And so, simple problems are for example, w4, right. If we could take large problem and break it down into Four small problems like W4, right. We know this will be very simple to do. Now that we have the simple problems but you split the bigger problems into simple ones, we need emergenal operation to get some intermediate solution. You can do this here also. And then you have to undo the split you did here. And that will be the merger which gives you the solution, okay. And so the key idea is how to do this split and merge. And if you know how to do it once, that was one of the steps. And you can do it a second time and then combine again and then you have the result. In this case we enter with vector x, small x. That's a finite dimensional vector, let's say of size N. And you end up with DFT coefficient, which is a finite dimensional vector of size N which contains the result of the matrix vector multiplication. Okay, so is that going to help us? Okay, so if we don't have a good trick to compute W times X it's a matrix vector product. The matrix has size N by N, x of size N, of course. Then the complexity is order N squared. Okay, and so to Fourier, you have to take N products between lines with a column. So inner products that are N of them, each one takes order and computation. That's an N squared, a quadratic complexity problem. Okay, but when we study the problem when N is the power of 2 and we cut the problem into two problems of size N/2 and if the complexity is quadratic, then the problems of size N/4 have complexity, sorry, of the problems of size N/2 have complexity N squared/4. We might have some complexity that is involved in recovering of the full solution that might be a complexity order N. So our solution now needs 2 times N/4 times 2. That's N/2, that's N squared/2 + N, that's to do the recombination. Okay, now if N is bigger than 4, this is better than N squared, as you can immediately verify, okay? It's not a big saving, okay? It's a saving by essentially, a factor of 2 here because the dominant term, of course, is the N squared. Instead of having the original N squared complexity, we have the N squared over 2 complexity. Okay, so let's show this graphically. Take a DFT, cut it into into 2 pieces of size N/2. Compute two DFT's of size N/2. Merge the two results. Good, so let's do this. So we start with x, it's a size N vector. Okay, that entry, x[0], x[1], etc., up to x[N-1]. Split it into two subproblems. Okay, the upper and the lower halves, and compute DFTs of size N/2 here and there. Do the merging and get the result here, x also of size N. Okay, and we just saw the original problem, this problem, if we don't do anything special about it, it has complexity N squared, whereas these problems here have complexity N squared/4. This guy also, N squared/4. And so this overall thing now, after split and merge, has complexity N squared/2. Now comes the magic trick. If we knew how to divide the problem once and save some computation, well let's just do it again. So divide and conquer is reapplied and since N is the power of 2 we can cut the 2 problems of size N/2 into 4 problems of size N/4 and we can keep going. Again, there might be some complexity to recover the full solution. Assume the complexity is order N, as we will see in detail. Well, this division can be done order log 2 of N times, okay, because we divide N into N/2, N/4, N/8, etc. And if you do this log N- 1 times then you get sizes of problem 2, okay. Which are very simple if you remember W2 is very simple matrix which takes 1 addition to compute the first line versus times of vector N1 subtraction for the second line. Okay, so each time we have this order N complexity to do the merging, and we get this magic result that an algorithm that uses divide and conquer at every step divide the problems into half problems over and over again. We'll have a complexity N logging base 2 of N. And of course if N is large, this is much, much better than N because log 2 N is much, much smaller than N, okay. Very good. So graphically, again, split the DFT into sizes, which are of size N/2, N/4, N/8, for example. In this case, you end up, so you'll have 2 problems of size N/2, 4 problems of size N/4, 8 problems of size N/8, so this leads us to compute 8 DFTs of size N/8. Merge the results into DFTs first of size N/4. So that's the two N/8 becomes this. And this becomes N/2. And this becomes a result of size capital N. Graphically, this is a very complex diagram but you don't have to draw it yourself. You can just admire it here on the slide, okay? So you take the vector x of size N. You split into 2 problems of size N/2. You split into 4 problems of size N/4. And finally 8 problems of size N/8. Okay, so we have the original DFT of size N, which has been divided into smaller DFTs of size N/8, okay, exactly 8 of them. Then you have to merge this, and this will have complexity order N, as we announced. We'll see how this concretely done in an example. And you can see in this particular case, we have 3 times N + N times DFT of size N/8. Okay, samples of the complexity of this particular algorithm. After having seen a conceptual picture of the divide and conquer algorithm for the DFT, let's look specifically at an algorithm for the so-called division or decimation in time DFT. So, we start with the usual formula for the DFT. So it is summation over capital N terms of the time domain sequence times the root of unity raised to the nkth power. We have seen this many times now. To start the algorithm, a good guess is half of the answer. Okay, so a good guess please. Remember the good guess, it will help in the future. So the good guess is that we want to divide the problem into two halves. And we could do this easier by taking the first half of the sequence and the second half. Or because it's called decimation in time, we are going to take the even index samples and the odd index terms. That's why this is called decimation in time, like we have seen in certain operations in single processing on sequences. So we take the sequence and we derive two sequences, x[2n] and x[2n+1]. Now, the input was broken into even and odd index terms. So on the output side the DFT results, or that capital X[k] we break into the first half, X[k], plus the second half, X[k + N/2]. Where k now runs only from 0 to capital N over 2-1. So we did two things here, we took even and odd indexes. So let's call this the even ones and the odd ones that was on the input. And on the output, we took the first half and the second half, all right? And we now look at the even and odd are input separately, so this would be a bit of formula menagerie here. So let's write X[k] as the first half of the sum, so N goes from 0 to N over 2-1. And the second half, which is the odd index terms, okay, so the even guys, the odd guys. And we do some rearrangement here. It's very simple, we expand the exponent. And the exponent, of course, is now W2nk+k. Then we move this part here in front, okay? Because it doesn't depend on the summation. Because the summation is over n, so it only concerns this part. So first summation has not changed. It's the same from first line to second line to the last line. We'll just see this in a second. The only thing is that W2nk, well, we can write this W2nk is e to the -j2 pi over N times 2N, okay? And this is equal to e to the- j2 pi over N over 2. Times, sorry that was an nk here, times nk. And this, of course, is equal W of root N over 2 times nk. All right, so this is the simplification we have done here. And this will be very important because this is simply a DFT of size N over 2. Over the even index samples here with respect to root of unity of order and over 2. Okay, so here we identify a DFT which is of size N over 2. Now, the second half has exactly the same effect. Simply, we have this pre multiplication but it's outside. But the summation inside is exactly, again, a DFT of size n over 2. And this DFT is simply taken over the odd index samples, okay? So, we write this more compactly as two transforms, X'k, and Wk times X"K. And these are the results of these DFTs of size N over 2 over the even guys. And the DFT of size N over 2 of the odd guys, okay? So this is exactly this divide of the problem into two sub problems. In words, we can compute X, the DFT of x, with two half size DFTs, which we have called X', X''. Multiply the second DFT by Wk, k going from 0 to N over 2-1, and adding the results, okay? So it would be x, first half here will be X'+W to the k times X", and this for indices over the first half. The next step is, that we are going to consider separately the first half of the output and the second half, okay? So the first half, we just keep the formula we have shown on the previous two slides, okay. So X[k] is the sum of X' and X''. There is a weight here Wk, k goes from 0 to N over 2-1. For the second half, k still goes from 0 to N over 2- 1. We simply add the shift here by N over 2. So we just write out the formula as we had, and it simply creates this little elements here in the exponent. And we have to worry what this will do, okay. Now, we remember that W N over 2 is e to the -j2 pi over N times N over 2. This is equal to e to the –j pi, which is equal to –1, okay? And because of this, we can go to the following simplifications. So in front, we have an N over 2 here, that is at the power of W N over 2. Okay, so there we remember that this is equal to 1. We saw this at the very beginning. So this thing disappears and the first half here is very simple, okay? What do we recognize here? It's a DFT of size N over 2, nothing too surprising here. And this character here we just saw before, just gives us -1 in front. And so instead of the plus that we have above here we have the minus sign, so W to the k, And last but not the least we do the same simplification here. So this falls out, it's an integral power N/2 of, root of order N/2. So this is the second DFT, okay? So what we end up having is something very simple is that the second half looks very much like the first half except that we have a change of sign here, okay. We are going to put these in words. We can compute X[k] and X[k+N/2] where k goes from 0 to N/2- 1 by dividing the input into even and odd indexed samples. Compute two DFTs of size N/2, that's the division. Multiplication of the output of the second DFT by WK, that would cost us N/2 multiplications. Combine the output with sum and difference, that's what we have just seen on the last slide. Graphically, let's draw this specifically on lengths 8 DFT. So the input is x[0] to x[7]. We have here DFTs of size 4. Then, the second output, we multiply by W N to the k. So that's W N 0, 1, 2, 3. Then we take sum here and difference. That gives us X[0] plus X[0] plus N/2, that's X[4]. That's one combination. Then plus minus gives us X[1] and X[5] etc until we get x(7), all right? And so we have very clearly mapped out this division, so it takes the input. So we start with x(0), x(1) to x(7) and we divide x(2) will come here, etc, x(3) will go there, x(4) goes there, x(5) goes there, x(6) goes there, x(7) goes there. Okay, so we have done the division of the input into two halves, the even and the odd. We have taken two DFTs of half size. We have weighted the output of the second DFT by these complex numbers. We have done sum and differences and we get our final result here, capital X, okay? What was the cost? Well, we have 2 DFTs to compute of size 4 in this case. Then we have N/2 multiplications. And we have N, additions, okay. So what is the complexity now? Splits the DFT into 2 pieces of size N/2. This is essentially free. Compute 2 DFTs of size N/2, but if we don't know anything better, that's going to be N/2 squared which is N squared/4 times 2 it's N squared/2. You've seen this before. We merge the two results that's N/2 complex numbers multiplication by W k. So a total of N squared + 2 + N/2, which is indeed smaller than N squared for N large enough. So in general, it's how the complexity of the initial problem, okay? Why, because we have moved from N squared into 2 times N/2 squared which is N squared/2, all right? That's if we do only one step of division and merging. Of course, as announced, we can repeat this, okay? And we'll repeat this until we get to trivial, small DFTs of size 2, so sum and difference. This requires log in base 2/N- 1 steps. Each step requires a merger, so a merger as we have seen requires N/2 multiplications and additions. So the total now is N/2, log in base 2 to N- 1 multiplications and N log 2 times of N times N additions. So savings is of the order of log2 N/N, and the key result that I want everybody to remember because it's such an important result a DFT of size N requires order N log2 N operations. Why do I say it's such a key result? Because essentially signal processing has lived and died by fast algorithms. So the invention of the fast Fourier transform, this algorithms that I showed you by Cooley and Tukey in 1965 created an explosion of interest in digital signal processing. Because all of a sudden, computations which seemed very hard before were actually feasible even on very simple computers at some time, okay? So the birth of digital signal processing is really linked to the invention or the rediscovery of the fast Fourier transform algorithm. Okay, now we can see this as matrix factorizations. Okay, so let's see this first on a DFT of size 4. What do we do? We separate in even and odd samples then we compute two DFT's, we compute the sum and difference. Okay we've wrote it out here I'm not expecting you that you understand this immediately by looking at 4 matrices of size 4 by 4, so let me try to explain. This matrix here does the division into even and odd. Okay, how do we see this? The first entry takes X1, sorry X0, the second entry here will take out X2. The third entry takes out X1, and the last one X3. Okay, so now we have separated odd and even. Then we do 2 DFTs of size 2, right? Remember the initial problem is of size 4, so half size problems are of size 2. These are DFTs of size 2. And then we have to multiply the outputs by the roots of unity meaning the outputs of this character here. So that's these two guys. And we have to recombine them by sum and difference and this was combined into this matrix here which does both the multiplication here by the roots of unity and the sum and difference. And this is exactly W4, this matrix we had seen before. This uses 8 additions, no multiplications. Right over the additions here, 1, 2, 3, 4, 5, 6, 7, 8. Okay, 8 additions no multiplications. So a DFT of size 4 is very cheap, please remember. Okay, if we could do it for N = 4, it's tempting to do it for N = 8. Now, this is going to be big, and it's going to be too big for a single slide. So we first write down what W8 is. It's a usual matrix, first line and first column of 1s. And then successive rules of unity to power 1, 2, 3, 4 up to 7. Then even powers etc up to W49. So step 1 is to separate even and odd index samples. This is exactly what we have seen before. This picks out the even sample. This picks out the odd samples. This we call D8, or a decimation of size 8 matrix. It requires only index changes, no arithmetic operations. Okay, then we want two DFTs of size 4. Okay, now we're not going to factorize these guys, it's already long hand, just a couple of slides ago but you recognize a DFT 4 here. So this is a 0 matrix and we write this compactly as a matrix which is a block matrix with two W4 matrices and two zero matrices. Okay, now, this requires two DFTs of size four. So that's 16 additions, right. So far so good. Now we have to multiply the output of the second DFT of size four by Wk, with k going from zero to three, okay? So the first half is untouched. The second half is weighted by 1, W, W squared, W cubed, okay? And in compact notation, we can again use block matrices so this is an identity matrix of size 4. This is a diagonal matrix with the weights given by 1, W, W squared, W3. Okay, this is this character here. And we have again zero matrices of size 4, okay? This requires two multiplications, two complex multiplications, why? It's this multiplication, here's this one. W square is actually equal to minus j. And this doesn't require complex operations. It just requires interchanging real and imaginary [INAUDIBLE]. Okay, so so far we have 16 additions, two multiplications. Okay, now we have to recombine the final output. If you remember, it's sum and differences of X[k] and X[k+N/2]. So we do this for the first output, it's the sum here. Second output is a sum. Third output is a sum. Fourth is a sum. The fifth output is a difference. Again, a difference for the sixth, seventh, and eighth output. And this we can drive compactly by recognizing that these guys are identity matrices. And here there's simply a sign change. Okay, so in block matrix from we have this very simple formula. The identity of size four, identity of size four, identity of size four minus identity of size four. This requires eight additions. So if my calculations were correct we had 16 additions with our two DFTs of size 4. Then we have two complex multiplications, and now we have, again, eight additions. So these are additions and multiplications. So we have a total of 24 additions and two multiplications. Just as a summary, we rolled down this factorization. D8 is the separation into even and odd index terms. These are the two DFTs of size four. This is weighting by 1WW squared W cubed. This is the sum and difference to recover the output which is W8, okay? Please remember, this nice result here, that's a small DFT, has very low complexity. So we can write this out in a flowgraph view. I'm not going to go through this detail, but you can sort of see what is happening. We separate the inputs now, not just evens and odds because we do this twice. So we end up having x0 and x4 plus neighbors, x2, x6, x1, x5, x3, x7. Then we do the weight gains, the sum and differences and we keep going and here is a last sum and difference. Which indeed computes the result here. Okay so this flow diagram is everything we have seen before put into a flow diagram where everything moves from left to right. It gets weighted when we have a complex number next to it. And it gets summed or differenced when we have a merger of two arrows, okay? And you can count here the number of multiplications, you will get the same number of 24 adds and two multiplications. These operations are always on complex numbers. Okay, so you might say this was a lot of effort on a very small sized DFT. So let's see where this has a real impact in practice. So in image processing, if you have a digital camera, you take large images, 1,000 by 1,000 pixels or something like this. And out of these large images, small blocks are derived. Okay, so we take a large image. Okay, let's say 1024 by 1024 pixels to make it simple. And then you divide it into small blocks here. And the small blocks are of size 8 by 8, okay? On that 8 by 8 block we take a transform called the discrete cosine transform. [INAUDIBLE] transform, discrete cosine transform. And a DCT is essentially a real version of a DFT. I don't want to get into the trigonometric mess of defining the DCT in detail. We could do this in a supplementary lecture. For now let's just assume it's a real version of the DFT. And it has a fast algorithm that is very close to a DFT algorithm. So it's a little bit more complicated, little details, phase shifts and so on. But to a first view, it is like a real version of a discrete Fourier transform, okay? And if it takes a direct transform, well, 8 by 8 is 64. So there is a matrix of size 64 by 64 that multiplies a vector where we have the pixels from the 8 by 8 block. So 64 squared is 4,000 and something, 4,096. And for every block you have to take this transform. Okay, and each multiplication has to be computed with fixed point multiplication. It's expensive, takes time or hardware. Okay, now there is a first trick, that this transform is a so-called separable transform. So you can take your eight by eight block, okay let me try to make an approximation of an eight by eight block. Okay, sorry it takes me longer than it should [LAUGH] but that's an eight by eight block. It turns out this [INAUDIBLE] DCT transform you can take, transform over each line. Okay, that's eight transforms and transforms over eight columns. That's another eight transformed. Okay, so that's 16 DFTs let's say. Okay, now the algorithm we saw has two multiplications and so we have a total of 32 multiplications. 32 rather than 4,000, that's two orders of more than two orders of magnitude, right. Okay, so if you have a fixed point processor in your camera instead of working for a hundred seconds, it will work for half a second or so. So that's a big deal. If you have taken pictures with a digital camera you know that you like to take the next picture. And you don't want to wait for a hundred seconds. So this is a real big deal. And this is at the heart of an image compression algorithm called JPEG. And that we will discuss later in this digital signal processing class. Okay, it's time to conclude. The short answer to this module is don't worry, be happy. There is a Cooley-Tukey algorithm, very popular. That for N which is a power of 2 gives a complexity that is of order N lg and base 2 N, really big deal, okay? Maybe one of the most important algorithms ever invented. Now this is the one we explained here. I want to be clear that for every possible length, there is an FFT algorithm. It's not going to be a Cooley-Turkey. It's not going to be the simple divide and conquer in pieces which are house pieces of the initial problem, etc. But there are algorithms for every possible length. Even lengths which are prime, even lengths which are the product of different primes, etc. Okay, so when you compute an FFT you should never zero pad the vector so as to get the length that is a power of two. This is something that people do. And it's a mistake because you add zeros when maybe there shouldn't be any zeros. And you should just take the initial lengths and look at the package that is out there. Like for example the Fastest Fourier Transform in the West, FFTW. Which has a whole series of routines and works very well. There are also other the packages, one is called SPIRAL. That also computes very efficiently all sorts of different fast Fourier transforms. And if you use these packages, it will make a big difference. It will make differences by orders of magnitude, okay? And with this, I think we have finished this entire module on the Fourier transform. The Fourier series, and the DFT, so discrete Fourier transform. That we concluded with this topic on computational complexity and fast algorithms. That has made a big difference in the implementation of signal processing algorithms.