Learn the fundamentals of digital signal processing theory and discover the myriad ways DSP makes everyday life more productive and fun.

Loading...

来自 École Polytechnique Fédérale de Lausanne 的课程

数字信号处理

284 个评分

Learn the fundamentals of digital signal processing theory and discover the myriad ways DSP makes everyday life more productive and fun.

从本节课中

Module 6: Digital Communication Systems - Module 7: Image Processing

- Paolo PrandoniLecturer

School of Computer and Communication Science - Martin VetterliProfessor

School of Computer and Communication Sciences

We have seen in the previous module, that the key ingredients for

successful image compression algorithm are compressing at block level,

using a suitable transform to change the basis that represent our blocks,

smart quantization and entropy coding.

In JPEG these elements are implemented like so.

The images split into 8x8 non overlapping blocks.

Each block is transformed using the DCT, the discrete cosine transform.

Each DCT coefficient is that quantized using

psycovisually-tuned quantization tables, that we will analyze in just a minute.

And finally, the resulting bit stream is compressed using run-length encoding and

Huffman coding.

So we will now examine each of these components in more detail.

So the first step, as we said, is we take our image and

we divide it into 8x8 blocks.

Then, each block is transformed using the DCT.

Now, here in this image you see a detail of a transformed image.

You can imagine that this is one block, and this is another block, and

so on and so forth.

So we have a detail of about 20x20 blocks in the dog picture we used so far.

We're plotting here the magnitude of the DCT coefficients.

And if we zoom in, we can see that for

each block, only very few coefficients are significantly larger than zero.

We will exploit this fact to reduce

the amount of information used in encoding the image.

The first idea is to use the deadzone of the quantizer

to capture all the small valued coefficients and send them to zero.

The second point, however, is that with the remaining bit budget we're going to

allocate more bits to the most important coefficients in the transform.

The level of importance of each coefficient has been determined by running

psychoviual tests.

In these tests, subjects are asked to express a preference between two

images encoded with different quantization strategies.

So by running this experiment several times,

using different allocation strategies, one can arrive at an average allocation

strategy that works pretty well for all images and all viewers.

JPEG allows you to specify your own quantization table,

if you feel like running the psychovisual experiments at home.

But also comes with a default table, which is the one shown here.

The way this works is each entry in the matrix is the normalization factor

that we will have to use for the corresponding DCT coefficient.

And remember we're using a deadzone quantizer,

which corresponds to a rounding operator.

And so what we do is we take each DCT coefficient, say for index k1 and k2.

We divide it by the corresponding entry in the quantization matrix.

And then we round the result.

And this will give us the quantized DCT coefficient.

Now for instance, here, the DCT coefficient as 0, 0,

which is the DC component of the block, will be quantized with the factor of 16.

What means, is that there will be a deadzone extending from 16 to -16.

So the width of that zone,

which is also the width of the quantization interval is 33.

Everything that falls into this deadzone, will be quantized to 0.

Everything that falls in the zone from 16 to the next level, which is 16 plus 33,

99, will be quantized with the first level of the quantizer, and so on and so forth.

And the same thing for negative values of course.

So if the entries in this matrix specified the width of the quantization interval,

small values will denote a fine grain quantization and

therefore, more bits allocated to that DCT coefficient.

Whereas, large values like we have here, denote a very coarse quantization,

which means that most of the values will fall into the deadzone and will be 0,

pretty much all the time.

One may wonder whether it's worth it to go through the trouble of specifying

a nonuniform bit allocation strategy, and here is an example.

We have a JPEG encoded image at a very low bit rate, about 0.2 bits per pixel.

And on the left, we have a uniform quantifizer that

allocates the same number of bits to each DCT coefficient.

Very few bits per coefficient, given at the low bit rate.

And here you have psychovisually-tuned quantifizer.

And it's clear that the image on the right is visually much

better than the one on the left.

If you zoom in, you can see that for instance the details here,

around the edges, are much better when we use a psychovisually-tuned quantizer.

Because the resultant bit allocation manages to mask

the block artifacts of JPEG encoding more successfully.

Okay, now, so the coefficients are quantized, most of them will be zero.

And in general, they will decrease in amplitude as we move away from the 0,

0 index of DCT.

The trick here is to unroll the 2D representation of the DCT coefficients

into a string of quantized coefficients.

And the idea is to obtain a string where, in probability, the largest values

are in the beginning and the rest, at the tail of the string, is just zeros.

So to do that, we will operate a zigzag scan on the 8x8 DCT coefficient matrix,

that will create this long tail of zeros.

Graphically, it looks like this.

We start at the 0, 0 coefficient, and

then we start scanning the matrix in this direction here.

So as we proceed in the scan, we will reach this zone,

that if you remember from the quanitzation matrix before,

has very large factors anyway, so it is likely to be filled with zeros.

Here is an example for instance, suppose that we have DCT block and

that we quantize that block, and we obtain the following quantized DCT matrix.

So as expected, most of the values will be 0 and

there will be a few nonzero values that we have to encode.

If we unwrap this in a zigzag scan like this,

we get the following string of values.

So we have a few nonzeros scattered around here,

and most of the 64 values will be 0, as expected.

So now we have to try and code this string of numbers in a smart way.

The idea is to employ run-length encoding.

Each nonzero value in the sequence of quantized DCT coefficients,

will be encoded with three numbers, which we call r, s, and c.

r is the run-length,

which indicates the number of zeros that came before the current nonzero value.

s is the size of the nonzero value in binary format,

namely the number of bits that you have to use to encode that value.

And c is the actual value.

An additional trick is that if r and

s are both 0, that means that every subsequent coefficient is 0, and so

we can stop encoding the string of coefficients at that point.

So let's look at an example, and

assume that this is the matrix of quantized DCT coefficients.

Remember, the zigzag scan proceeds in this order here, like this.

We start at the first position.

The number of zeros that can be 4 is, of course, 0,

because this is the first number in our string.

The value of the number is 100, and so you will need 7 bits,

there is one implicit sign bit, so we don't count that, 7 bits to represent

a 100 because the logarithm in base 2 of 100 is a little bit more than 6.

Then the zigzag scan proceeds, and the second character that we encounter is -60.

So again, there were no preceding zeros.

60 will require 6 bits to encode, again, with explicit sign bit,

and this is our second symbol in the sequence.

Then we continue, and we start encountering zeros in our zigzag scan.

So we go here, then we go here, then we go here, then we go here, and

then we find the number 6.

So we had 4 zeros preceding the current nonzero value.

6 requires 3 bits and so this is the third triple in our sequence.

And then we continue, 1, 2, 3 zeros, and then we we get 13.

So 3, 13 and these four bits, and we got this.

Now we continue some more.

And we have that we have 8 zeros and then the value -1, which will require one bit.

And then, after this guy here, it's going to be all zeros.

So, instead of encoding everything separately,

we just put the pair 0,0, and we're done.

So this whole block of 64 coefficients,

is actually encoded by this sequence of triples.

By design in JPEG, the maximum number of bits that can be allocated to a single DCT

coefficient is 11.

Which means that s will be between 0 and 11,

which means that s can be encoded with 4 bits.

And similarly, by design, the maximum run-length is 15,

so r will be between 0 and 15.

In the rare case of a longer run,

we simply split that run into two or more shorter runs.

So that means that r, again, can be encoded over 4 bits.

So the pair r, s will use a total of 8 bits.

And therefore, it belongs to an alphabet where the size of the alphabet is 256.

If we don't do anything special, each pair r, s will require 8 bits for storage.

But it turns out that some pairs are much more common than others.

For instance, if you have a long run of zeros,

it is likely that the size of the nonzero coefficient that follows, will be small.

So a lot of space can be saved if we are smart and we encode

this symbol using the same philosophy that we have seen for the morse code.

In other words, we need to find which are the r,

s pairs that occur most often in the encoding of an image.

And then, encode these pairs with fewer bits than we would use for

less common pairs.

Now, if we start to use a variable number of bits to encode a symbol, we need

to know how to parse the resulting bit stream, to separate the different words.

Each symbol will be a binary word from now on.

So in English, for instance, we have spaces that separate words.

Although, we incur some waste because we need to define an extra symbol

to specify the boundaries between words.

In Morse code, the problem is exactly the same.

Different letters are separated by pauses in the transmission.

So you have gaps that identify the boundaries between letters.

And longer pauses, that identify the boundaries between words.

All that is wasteful because it requires extra space.

And the question is, can we do away with separators, and have a bit stream that is,

sort of self-parsing, without the need for an extra symbol?

The answer is yes.

And the way we go about doing this, is by using so-called prefix-free codes.

The fundamental intuition behind prefix-free codes

is that no valid sequence can be the beginning of another valid sequence.

This will be much clearer in just a second, when we see this graphically.

But the result is we can then parse a bit stream sequentially,

without the need to look ahead for the next symbol.

And be absolutely sure of the boundaries between different binary words.

It's very easy to understand this if we use a tree structure to

represent the code and to represent the the coding process.

So, this is a binary tree that represents the encoding of four symbols, A, B,

C, and D, with a prefix-free code.

The way this works is that we associate, at each node, two branches.

The upper branch is associated to the binary digit 0, and

the lower branch to the the binary digit 1.

The prefix-free condition is obtained by associating the symbols only to

leaves of the binary tree and never to intermediate nodes.

The binary word that encodes any of these symbols,

is obtained by following the path from the root of the tree to the final symbol.

So for instance, the symbol D, would be encoded by 101.

The fact that the symbols are only at the leaves of the tree,

implies that no symbol sequence is the prefix of another symbol sequence.

Now, when we get a binary string, we can proceed sequentially like so.

Suppose this is the binary string we want to decode,

we start pushing the binary digits on the tree.

And we will take the upper branch at each node, or

the lower branch, according whether we have a 0 or 1.

So we start, the first digit is a 0,

and the 0 takes us directly to the encoding for the symbol A.

So the first symbol in the sequence will be A.

The second symbol is 0, and it's exactly the same thing, so we have another A.

Then, we have a sequence that starts with 1.

So the first 1 will take us to this intermediate node, so we're not done.

And the second 1 will take us to a leaf that corresponds to the symbol B.

So the third symbol in the sequence is B.

Then 0 will give us A, 0 will give us A,

11 will give us B again, then 0 will give us A.

And then we have the sequence 101,

which, if we follow, will take us to an intermediate node, so we're not done.

0 will take us to an intermediate node, we're not done.

And then 1 will take us to the symbol D.

And finally, 100 will take us to the symbol C.

So as you can see,

the binary sequence is uniquely decoded into a sequence of symbols.

Now that we know how prefix-free codes work,

the goal becomes minimizing the total bit stream length.

And the intuition is rather clear.

For the symbols that appear the most often in the input sequence,

we should use short binary sequences.

For instance, in the previous example, we had a symbol that was associated to

the shortest possible binary sequence, just the digit 0.

While it would make sense to choose the most probable symbol for that position.

There is a famous algorithm by Huffman that allows you build the optimal

prefix-free binary code for a given set of symbol probabilities.

The JPEG algorithm provides you with a pre-canned Huffman code,

which is of course not optimal for every image that you encode.

But which performs reasonably well under most circumstances.

Alternatively, you can run the encoder,

compute the probabilities of each r, s pair, and build your own

Huffman code using the Huffman algorithm that we will see in just a second.

Of course, in this case, you will have to send the code along with an image to

inform the decoder on how to parse the bit stream.

And so there will be a side-information penalty that you will have to pay.

But sometimes,

when the image is very large, this could be advantages nonetheless.

Lets see how the Huffman algorithm works.

Suppose, once again, we have four symbols, and suppose

that we found that the probability table for the four symbols is the following.

The first symbol occurs with 38% probability, the second with 32%,

the third with 10%, and the fourth one with 20%.

The Huffman algorithm proceeds,

by selecting at each step the two symbols which are least probable.

So, in this case, it would be C and D.

And we build a subtree, where the two symbols are the leaves and

they are connected to an intermediate node.

Now, the idea is that the probability of going through this node, at this point,

will be the combined probability of the leaves.

So at next step of the algorithm we will have three probabilities,

the two symbols that are left in the pool, A and B.

And a composite symbol that will have a probability of 30%.

So, at the following step we select the two symbols from the set,

including the composite symbol, with the lowest probabilities.

And in this case it will be this and this.

And we repeat the process, so this is what we did at the first step and

now we're combining this node with this node, into another subtree.

Now here, B is a leaf, because it's a symbol.

And we get an intermediate node,

whose probability is the sum of the two children.

Again, we're left with a remaining symbol from the pool, with probability 38%.

And a composite symbol here, where you have 62% of total probability.

So at the last step, we finally combine the two remaining items.

And we have the final tree, where of course,

the probability at the root of the tree is is 1.

This binary tree represents the optimum prefix-free code for the set of

probabilities that we found experimentally from analyzing the frequency symbols.

And a JPEG encoder would use a similar tree to encode the r,

s pairs that were found in the run-length paths as the algorithm.