0:09

We see that it is not easy to measure the entropy or the uncertainty of a source.

There is typically structural or dependencies in

the data, which means that there is

redundant information, which need to be acknowledged

when measuring the entropy of the source.

1:21

The answer is that estimating the entropy of a source is a very difficult task.

Shannon himself actually looked into this question in trying to estimate the

entropy of the English language in his 1948 seminal paper and in other papers.

So if we assume a discrete memoryless source

model for the language and we assume that all

the letters are equally represented, have equal probabilities, then

this is the estimate of the entropy we obtain.

It's actually about 4.76, if we include the space as the 27th letter.

2:00

We can do better than that.

We still use a DMS model, but now we go

and measure the probabilities of appearance of each individual letter.

And in this case, the estimate of

the entropy becomes smaller, 4.14 bits per letter.

We still know, however, that this is not a

very accurate model, because there are dependencies amongst letters.

There are combinations of letters that are more probable than

others and combinations that do not exist in the English language.

So therefore, if one looks at these dependencies

by considering up to eight letters, then this

is the estimate of the source, of the

entropy of the source, that one can obtain.

So this is the most accurate or one of the

most accurate estimates of the source of the English language.

And therefore, this is the number one should try to come

close to in designing efficient codes for encoding the English language.

In general, the more we know about the structure of the

data, the better our estimate of the entropy is going to be.

We're going to demonstrate this with two simple toy examples.

So assume we have this stream here of symbols.

So this source has six symbols in it's alphabet.

In this stream we go and measure the appearance of two and four.

So it's two out of ten.

Therefore the probability of two is 0.2 and so it

is a probability of four and so on and so forth.

We use the formula of the entropy for a DMS source

and we find that in this particular case this is the entropy.

We need 2.44 bits per symbol.

But the way here, since they have six symbols if I, if I were to use a fixed

code then I would need three bits per symbol because two to the third is eight.

So with three bits I can express eight different codewords.

Here I have only six symbols but using a variable length code, in

principle, I could go closer to the entropy of 2.44 bits per symbol.

Now, if we stare at this particular stream here for, for

a little, we can possibly see some structure in it and to

express the structure we propose a prediction model.

4:32

So what this tells me, so x of n first

of all, the value in this stream at location n.

So if this n, then the one to the left is n minus 1 and so on.

So this first order linear prediction model tells us that I

can find the value of the signal at location n by

assuming it's equal to the value at the previous location n

minus 1, plus the residual or the correction tell or the error.

5:12

So the residual consists only of plus and minus 1.

I measure the probability of one, the probability of minus 1.

I utilized the DMS formula for the entropy.

And we see that the entropy is 0.7 bits per symbol.

So it's more than three times lower the

estimate than the entropy we, we first obtained.

We're going to look at predictors of this

nature multiple times in this segment of the course.

We'll talk about compression.

And the idea here is that we build a

predictor to predict, as here, the value of x

at location n, and then all we have to

encode is the unpredictable part, which is the residual.

5:59

Of course, in decoding it, I need to have the predictor model.

It's a static predictor, in this particular case.

So given the predictor model of the residual

I am able to construct the original sequence.

A similar example is represented here.

So this is now the generated message, the stream of symbols.

I have here I see three different symbols.

So I measure the probability.

So the source generates only symbol three, symbol four, and symbol five.

And I do see that I need 1.5 bits

per symbol to represent this, according to the entropy.

Now if we again stare at it for a second, we see that, in essence,

if I look at two symbols at a time, so it's like I'm extending

the source, I now use a block code, you might say, then I

only have the symbol 33, and this should be 45 here.

And, then in this case since they're equally probable, the entropy's

one bit per new symbol or therefore half a bit per original symbol.

So again, we can, reduce the entropy estimate considerably by looking

into the structure of the data, exploiting it, and capitalizing on it.

Let us look now into a simple example of codes.

In getting a better understanding of them and then

also introducing and discussing some other properties of codes.

We are given this toy source.

It emits one of four symbols and

here are the associated probabilities of the symbols.

8:04

And we can find that for this particular

source here is its entropy, 1.75 bits per symbol.

You are given this code, Code 1.

And we are asked to figure out whether this is a good code, a useful code.

It takes a quick second to look at it and

realize that if we are given for example code word 0.

It's not clear whether we should decipher it as symbol S1 or as symbol S2.

So this code is not uniquely decodable or uniquely decipherable.

We can find the average codeword length for this code or the rate for this code.

It's 1.125 bits per symbol and since this is smaller than the entropy, this

is already a red flag that there's probably something wrong with this code.

We are also presented with Code 2 and we're asked the same question.

Probably takes an additional second to also

realize that this code is not uniquely decodable.

If we are given, for example, the sequence of

code word 00, we could decipher them as S1,

S1 or we could decipher them as symbol S3.

So this code is not uniquely decodable.

Again, we look at the rate of the code is below

entropy, the Source Coding theorem is violated, it's not a useful code.

9:41

We're given Code 3.

One of the observation about this code is that this is a prefix code.

The name might be a bit misleading.

It's actually a prefix(-free) code, which means no

code word is a prefix of another code word.

As such, it's intuitively clear that this code is uniquely

decodeable and more than that, the decoding is instantaneous.

In other words, if we look at the bit stream, let's say 010110,

we start deciphering it, so as long as,

as soon as we see 0, we know that this should be coded as 1, because 0 exists.

Is a code where there can find in Code 3.

Then I see that 10 is a code that exists there, so this is deciphered as S2.

110 is deciphered as three.

So this is a prefix code and it's uniquely decodable, any prefix code.

And it's also an instantaneous.

We see actually that its rate, the average code word length,

is exactly entropy, so this is an optimal you might say, code.

It achieves entropy.

We cannot do better than that.

We are also given Code 4.

11:08

It takes a bit more thinking now to see whether this is uniquely decodable or not.

Actually, there is a specific algorithmic test based on

which you can test whether a given code is decodable.

And this is indeed a uniquely decodable code.

It's not a prefix code, because zero, for example,

appears as a prefix to all other code words.

So not prefix.

11:40

And this not instantaneous either, since it's not prefix.

So we see that unique decodability and prefix code relationship goes one way.

So prefix means uniquely decodable, but uniquely decodable does

not mean that the code is a prefix code, as indicated by Code 4.

We see that the rate of this last code is a bit higher than entropy.

So if one were to choose one out of the four codes, then

Code 3 would be the one of choice, because being a prefix and

instantaneous is an advantage and also its rate is the lowest one among the four,

although as we've already discussed, these are not useful or acceptable codes.

13:21

So clearly, to get to codeword S1, I have one 0, so S1 is represent by 0.

S3 is represented by 00.

And S4 is represented by 11.

This ending nodes are called leaf nodes.

And actually, the root is also a branch node, because

the tree branches out at that particular location.

So let's do the same thing for Code 3.

14:05

So to obtain the code word for symbol S3, we see that we

have to go through a 1 here, a 1 here, and a 0 here.

So the code word for S3 is 110.

Now, an important observation here is that this Code

3, if you remember, it's a prefix code, which you can

observe right away by looking at the 3 representation,

in order to get to any of the code words, we don't need to go

through another code word, which was the case for example

with Code 2.

To get to S3 for example, we have to go through S1.

So S1 is a prefix to S3, but here this is not the case.

So none of the code words is a prefix

from another code word and therefore is a prefix code.

And equally importantly is that the leaf nodes here are represented by symbols.

So the symbols, rather, are

represented by leaf nodes.

15:44

So, I put here also the various terms that I introduced.

So this is an important point that we see here for prefix codes and we'll

see that when we talk pretty soon about the Huffman code,

that due to the construction I end up with a construction like this.

In other words, the symbols are represented by leaf nodes

and therefore, we know that this is a prefix prefix code.

16:27

So, as mentioned, it's hard to tell if a code is uniquely decodable.

However, there is a systematic procedure, which, based on which one can determine

in a finite number of steps if, indeed, the code is uniquely decodable.