The lectures of this course are based on the first 11 chapters of Prof. Raymond Yeung’s textbook entitled Information Theory and Network Coding (Springer 2008). This book and its predecessor, A First Course in Information Theory (Kluwer 2002, essentially the first edition of the 2008 book), have been adopted by over 60 universities around the world as either a textbook or reference text.
At the completion of this course, the student should be able to:
1) Demonstrate knowledge and understanding of the fundamentals of information theory.
2) Appreciate the notion of fundamental limits in communication systems and more generally all systems.
3) Develop deeper understanding of communication systems.
4) Apply the concepts of information theory to various disciplines in information science.

Choh-Ming Li Professor of Information Engineering, and Co-Director, Institute of Network Coding

Why is the channel capacity C related to the mutual information between X and Y?

Consider the block diagram for the communication system and observe

that the random variables W, X, Y, and W hat form a Markov chain.

Now consider the information diagram for this Markov

chain. Since we assume that the encoder is

deterministic, the entropy of the X sequence, given W, is equal

to zero. This is shown in the information diagram.

Likewise, we assume that the decoder is deterministic so that the

entropy of W hat, given the sequence Y, is equal to zero.

This is shown in the information diagram.

Now for reliable communication, W, the message, and W hat, which

is the estimate of the message, should be essentially the same.

So they are essentially functions of each other.

That is, entropy of W hat given W, is equal to zero.

This is shown in the information diagram.

And entropy of W given W hat is equal to zero.

This is shown in the information diagram.

Now, we see that the setup of the problem

actually imposes a lot of zeros in the information diagram.

From the information diagram, we see that the entropy of W is

equal to the measure on the atom marked by the red dot.

Likewise, the mutual information between the input sequence and the output sequence

of the channel is also equal to the measure of the atom marked by the red dot.

Therefore we see that H(W) is equal to I(X;Y).

That is, the entropy of the message is equal to the mutual

information between the input sequence and the output sequence of the channel.

This suggests that the channel capacity is obtained by maximizing

the mutual information between the input sequence and the output sequence.

And we are going to see that this is equivalent to maximizing I(X;Y),

the mutual information between the input and the

output of a single use of the channel.

We now look at the building blocks of the converse proof.

First observe that for all i from 1 up to n, the mutual information

between X_i and Y_i, by definition, is less than or equal to C,

because C is the maximum of I(X;Y) over all input distribution p(x,y).

Then summation i equals 1 up to n, I(X_i;Y_i) is less than or equal to n times C.

To be established in Lemma 7.16, the

mutual information between the input sequence X and the

output sequence Y is upper bounded by summation i equals 1 up to n, I(X_i;Y_i).

Therefore, one over n times log M, which is the rate of the code,

is equal to one over n times log of the size of the message set.

Because the message is chosen uniformly, this is equal

to one over n times the entropy of W.

As we have seen, for reliable communication, this is approximately equal

to one over n times the mutual information between the input sequence X and the

output sequence Y, which from the above is upper bounded by one over

n, times summation i equals 1 up to n, I(X_i;Y_i).

And that is upper bounded by C.

So, essentially we have the ingredients for proving that, for reliable

communication, the rate of the code cannot exceed C, the channel capacity.

We now prove Lemma 7.16, which asserts that

the mutual information between the X sequence and the Y

sequence is upper bounded by summation i equals 1 up

to n, the mutual information between X_i and Y_i.

First, from the previous proposition we have, the entropy

of the Y sequence, given the X sequence is equal to summation i equals one up to

n, the entropy of Y_i given X_i.

Then the mutual information between the X sequence and the Y sequence,

which is equal to the entropy of the Y sequence minus the entropy of

the Y sequence, given X sequence, by the independence

bound, is less than or equal to summation i equals 1 up to

n, the entropy of Y_i, minus summation i equals 1

up to n, the entropy of Y_i given X_i.

And this is equal to summation i equals 1 up

to n, the mutual information between X_i and Y_i.

We now give the formal proof for the converse of the channel coding theorem.

The key idea is to apply Fano's inequality to

to rigorize the approximation that we have used in the argument.

Let R be an achievable rate.

That is, for any epsilon greater than zero, there exists for

sufficiently large n, an (n,M) code such that 1 over

n log M, which is the rate of the code, is

greater than R minus epsilon, and lambda_max is less than epsilon.

Now, consider log M equals entropy of W, because

the message is chosen uniformly from the message set.

Now, the entropy of W is equal to entropy of W

given W hat plus the mutual information between W and W hat.

By the data processing theorem, the mutual information between

W and W hat is upper bounded by the

mutual information between the X sequence and the Y sequence.

By Lemma 7.16, the mutual information

between the X sequence and the the Y sequence, is upper

bounded by summation i equals 1 up to n, the mutual

information between X_i and Y_i, which is at most n times

C. By Fano's inequality, the conditional

entropy of W given W hat, that is, the first

term above, is less than one plus P_e, the

average probability of error, that is, the

probability that W is not equal to W hat, times

log, times log of the size of the message set.

And this is equal to one plus P_e times log M.

Then, from step 2, we have log M is less than or equal to H(W|W hat)

plus nC,

where by Fano's inequality,

H(W|W hat) is less than one plus P_e times log M.

Now P_e,

the average probability of error, as we have seen,

is less than or equal to lambda_max. And by our assumption

of the code, lambda_max is less than epsilon.

Now moving the term

epsilon time log M to the left hand side, we have 1 minus epsilon times

log M is less than one plus nC.

And moving 1 minus epsilon to the right hand side, we have log M less than one plus nC divided by one minus
epsilon.

Upon dividing by n on both sides,

we have one over n times log M less than one over n

plus C

divided by one minus epsilon. Therefore, following our assumption on the code, we have

R minus epsilon less than one over n times log M. And one over n times log M is less than one over n plus C
divided by one minus epsilon. So we obtain R minus epsilon
is less than one over n plus C divided by one minus epsilon.
To complete the proof,

we let n goes to infinity so that one over n goes to zero, and then let epsilon goes to zero, to conclude that R is less
than or equal to C.

We now discuss a corollary that gives a lower bound on P_e, the average probability of error, when n is large.

Consider the highlighted inequality we obtained in step 4 in the proof the converse coding theorem.

From this inequality, we obtain P_e greater than or equal to one minus one plus nC

divided by log M. Multiplying by one over n upstairs

and downstairs in the fraction, we obtain one minus one over n plus C divided by one over n times log M.

When n is large, the one over n upstairs is approximately equal to zero.

So this can be approximated by one minus C divided by one over n times log M,

when n is large.

This lower bound on P_e is illustrated by the figure

on the left hand side. We now give

an asymptotic analysis of P_e when n is large.

As we have shown, for large n, P_e is lower bounded by one minus C,

divided by one over n, times log M,

where one over n, times log M, is the rate of the channel code.

The above lower bound on P_e, says that if one over n, times log M, is greater

than C, so that C over one over n times log M, is less than one,

then P_e is bounded away from zero for large n.

This is illustrated in the figure on the left hand side.

If we take the rate of the code to be strictly greater

than C, then P_e, the average probability of error of the code,

when the block length n is large, can only

take a value in the range colored in blue,

which is bounded away from zero.

This asymptotic lower bound on P_e, also implies that if the rate of

the code is greater than C, then P_e, is strictly positive for all n.

For the details, please see the textbook.

Also note that this asymptotic lower bound on P_e tends to one, as

one over n log M, that is the rate of the code, tends to infinity.