0:00

One of the more interesting stories in the field probabilistic graphical model

relates to the connection between the, algorithm of Loopy Belief Propogation and

the problem of decoding messages sent on a noisy channel.

This is a connection that was discovered fairly recently and has proven to have a

tremendous impact on both disciplines that of probabilistic graphical models

and that of message decoding. So let's understand the message decoding

problem. Imagine that I want to send K bits across

a noisy channel. If I just go ahead and easily send those

K bits, then it's impossible for me to know from the noisy bits that I received

at the other end what the correct bits ought to have been because we don't know

which ones got corrupted and which ones didn't.

And so what people have, developed is the notion of coding theory.

Where those K bits are passed through an encoder.

And what we get at the end are N bits. where N is usually greater than K.

And those are the bits that are actually transmitted across the noisy channel to

give us a set of noisy bits, Y1 up to Yn, which we now hope to be able to decode in

a way that ideally gives us bits V1 up to Vk, that are close.

And hopefully, completely accurate relative to the bits that were originally

sent, U1 up to Uk plus message coding and decoding in a nutshell.

And the rate of a code such as this is k over n, that is you send n bits of which

k are actual information. The bit error rate, that one gets, and

that's a function of the channel itself, is the, overall probability that a bit

that we get. On this side, say the first bit, what is

the probability that the first bit is correct, that is, the same as as the

original bit, bit? And so the error rate is the total

average of the probability that these are different across the K decoded bits.

1:57

So, we can now think about different channels and their properties and there's

this very important notion called the channel capacity.

So let's first think about different error rates that a channel, different

error modes, that a channel might exhibit.

So here's something called the binary symmetric channel, often abbreviated BSC,

and the binary symmetric channel says that when I send a zero, then with

probability.9, I get a zero, but with probability 0.1, I get a one.

And conversely, with a probability, when I send the one, I get a one with

probability of 0.9, and the zero with probability 0.1.

You can see why it's called the binary symmetric channel because the area is

symmetric between zero and one. A different error model, an erasure

channel where the bits don't get corrupted, they just get destroyed.

Now the big difference between this, the binary erasure, erasure channel

abbreviated BEC is that when a bit gets destroyed, I know it.

That is what I get at the end, is a question mark as opposed to a bit where I

don't actually know whether that was the original bit that was sent or a

corrupted, here I know the bit was corrupted.

And finally, here is what's called a Gaussian channel,

where the noise that gets added on top of is actually analog noise.

And so, there is sort of a, a Gaussian distribution relative to the signal that

was sent. And it turns out these different channels

have very different notions of what capacity means.

And capacity, we'll define exactly what implications that has in a moment but

information there is. Your super smart people have computed,

the capacity for these different kinds of channels, and it turns out, that for

example the capacity for the binary symetric channel, is zero, is a little

bit over 0.5, where the capacity of the binary eraser channel, is 0.9 where 0.9

is, is this, probability of getting the correct bit.

And if you think about that it makes perfect sense that, the capacity of a

channel where you know the bit were erased is higher than ones where you have

to figure out whether bits are right or wrong.

So this is, little less, a more benign type of noise.

And the, what you see over here is the capacity of the Gaussian channel and it's

a, its a expression that involves things like the with this Gaussian for example,

the wider it is, the lower the capacity. Now, Shannon, in a very famous result

known as Shannon's Theorem, related the notion of channel capacity and bit error

probability in a way that tells us and pro, defines an extremely sharp boundary

between codes that are feasible and codes that are infeasible.

So let's look at the diagram. The x axis is the rate of the code.

Remember, the rate was k over n. In the example that we gave and this is a

rate in multiple channel capacity so this says once you define the channel capacity

we can look at the rate of a code and each code will try for a dif could, could

be in a different point of the spectrum in terms of the rate.

On this axis, we have the bit error probability.

So obviously, lower is better, in terms of bit error probability.

And what Shannon proved is that this region over here.

The, the, the, the region that I'm marking here in blue,

the attainable region is, you could construct codes that achieve any point in

this space. That is for any point in this, in this

2-dimentional space of rate and bit error probability you could achieve that that,

you can construct a code. He didn't show it as a constructive

argument, it was a, it was non constructive argument but he proved that

there existed such codes. Conversely, he showed that anything that

passes, on this side of the boundary the

forbidden region is just not obtainable. That is not matter how clever of a coding

theorist you are you, could not construct a code that had a rate above a certain

value and bit era probability that was below a certain value, which is the shape

of this boundary that we see over here. And you can see why the channel capacity

is called capacity because this is a multiple of one.

So this was Shannon's theorem and it set up as we said in non-constructive proof

that the blue region was an attainable region.

But the question is how can you achieve something that's close To the Shannon

limit. And, around, up to, a certain point in

time, the mid 90s.' This was, sort of, a diagram of the kind

of, Achieve, what was achievable in terms of

codes. And so this is a diagram we'll see

several times from, in a minute. so here is, on the x axis, we have the

signal to noise ratio, measured in db. And on the y axis, we have the log of the

bit error probability. And what you see here are codes that were

used, first up here we see the uncoded, version.

And you can see that the uncoded is not very good.

It, has very high, bit error rates, so high here is worse bit error rates, so,

very bad. And, of course, as the signal to noise

ratio moves to the left, the error rate grows.

And what you see over here in these two other lines are.

The bit error rate, sorry the, the curves achieved by two the NASA space missions

Voyager and Cassini and you can see that there was a good improvement between

Voyager and Cassini, which is 1977 to 2004.

7:59

And then in May -three a revolution happened in coding theory.

And this was a paper by Berrou, et al and they,

you can read the title of the paper. It was a shocking title.

It says near Shannon limit error correcting codes.

And they called this ter- codes. And initially when they tried to submit

this paper, people didn't believe them that this was possible because this was

so much better than any of the previous codes that had been proposed.

And people said no this can't be, can't possibly be the case.

And, and they checked it and it turned out that, in fact, their code worked.

But nobody really understood why. Okay, so now you might wonder well why

the heck am I telling you all this. what does it have to do with the

probabilistic graphical models? Oops.

Okay, we have to scratch that. so why was this so surprising?

Because if you look at the turbo codes in terms of this diagram that I showed you a

minute ago, you can see that over here, we have again, this is, remember the

uncoded. This is the same kind of diagram.

Signal some noise on the x axis, bit, log bit error probability on the y axis.

Here once again we have Voyager and Cassini.

And there. And look at the turbo codes.

9:25

The turbo codes are way to the left. Of any of these previous codes.

Now, again, remember left is better. Left means that you are achieving the

same bit error, probability with a higher signal to noise ratio.

So, so what is the magic of turbo codes? and this is where we bring ourselves back

probabilistic graphical models cause you might be sitting there thinking, well,

you know, fine, coding theory's great, but this is the coding theory class.

So what does this have to do with probabilistic graphical models?

So to understand that let's look at what turbo codes do.

So turbo codes took the same set of bits, that we want to transmit, U1 up to UK.

And they pass them through two encoders, encoder one and encoder two.

And then, those bits were passed through our noisy channel over here, so that what

we get at the end is noisy versions of those two sets of bits.

And what we'd like to do is really coding is a probabilistic inference problem when

you think about it, because what it's trying to do is it's trying to compute a

probability distribution over the message bits given my noisy evidence Y1 and Y2,

which are the noisy bits I received on the channel.

But instead of trying to solve the probabilistic inference problem exactly,

what Berrou et all proposed in this Turbocoding paper is this sort of

iterative approach, where we have a decoder that matches the

first encoder. So decoder one matches encoder one.

And decoder one takes these noisy bits that you get from encoder one.

that, after they pass through the noisy channel.

And decoder two gets a separate set of bits- the bits that were passed through

encoder two and then the noisy channel. And what happens, roughly speaking, in

the codes, is that the two decoders start to communicate with each other.

So, decoder one looks at it's bits and it doesn't exactly know what the correct

bits are, but it computes a posterior over those noisy bits.

And it takes the posterior over each of the noisy bits and it passes it.

The decoder two, which now uses it as a prior, over, the, values of the, of these

target bits the U, combines it with the evidence, from the noisy bits Y, and now

computes a more informed posterier over the U's, which it then sends, to, decoder

one. And this process continues to itterate

until convergence. So each decoder computes a posterior and passes it to the

other decoder, which uses it as a prior. And if you look at the, the implications

of the iterative algorithm what happens so you can see here is, is the, are the

different values for turbo, for turbo decoding and what you see here again is

the signal to noise ratio on the X axis the log of err probability on the Y axis

and you can see that initially in the first set iteration of turbo decoding the

curve is not that great. But then as the iterations improve, the

bit error rate goes down. so for example for a fixed signal to

noise ratio say here. The bit error rate goes down from here,

to here, over a durations. And, what you see as, as the dark line at

the bottom is the optimum bit decision if we were to do correct probabilistic

inference. And so the surprising part was how close

this totally heuristic and unmotivated algorithm as it seemed at the time comes

to the optimum of the bit decision. The revolution that happened in this

field came up when other people realized, people like McLeese and McKigh and Fray,

realized that what was really going on here is that this algorithm that we saw

of, over here is running a variant of loopy belief propagation, because what's

going on here is that each of these decoders.

Is actually running exact inference over a network that is sort of attractable and

then its passing messages these del these beliefs that there being passed are what

we have as the delta Ij's between in this case these two components.

Really, what was going on here are are two decoders that are communicating with

each other via a process that is exactly identical to the loopy belief

14:20

propagation. And this caused the revolution in the

field in two different ways. Actually, it caused a revolution in two

different fields. It caused a revolution in the field of

probabilistic graphical models because up until that point people had known that

you could pass loopy messages over the graph.

in fact this was an observation made as early as 1988 by Judea Pearl when he

wrote the seminal book that really introduced Bayesian networks to the field

of artificial intelligence. He wrote at the time that sure you can

pass these messages over this loopy graph but it doesn't, but you have no

guarantees of convergence, and you have no guarantees if this gets the right

answers and so it's not perhaps a good idea.

And from 1988 until the mid 1990's the algorithm had been completely abandoned.

Because people said that it had these limitations, so it wasn't a good idea.

But when. People realize that in the context of

coding theory, this algorithm, which is obviously heuristic and has all these

potential limitations. Never the less, succeed in achieving

performance. That looks like this,

then people said, wait a minute. If it works so well for decoding, maybe

it's a good algorithm to think about after all.

And from the mid 1990's until today, set up a huge amount of work in the field of

probabilistic graphical models on understanding when and why loopy belief

propagation works as well as constructing through versions of loopy belief

propagations, some of which we briefly mentioned in this course.

So that's the first revolution. The second revolution occurred in the

field of coding. Where people said, well, if.

All that's going on is that we're running loopy belief propagation over a graph

that is trying to compute the posterior over the use, the message bits given the

noisy bits otherwise. Well, we can construct lots of other

codes. And use those codes and run loopy belief

propagation on the resulting graph. And sure enough, some of the more

successful codes that are in use today. Are, in fact, not the, necessarily the

turbo codes that Berrour et al originally came up with but a class of codes called

low density parity checking codes. That are actually really good in elegant

match to loopy belief propagation. Now, it turns out low-density parity

checking codes were actually invented by Robert Gallager in 1962.

And since 1962 up until, sort of, the late'90's, they had been totally

abandoned because they were computationally intractable to do in

terms of exact probabilistic inference. But by realizing that one could run a

loopy belief propagation as a way of decoding using these codes, that

reintroduced the codes into the field, and now their wildly successful and

there's all sorts of variations on them and extensions and so on.

So what are these parity checking codes? So in a parity checking code, we are

actually sending two types of bits. So, these are original bits U, U1 up to

U4. The original set of bits.

Are just, the first set of this are just the original message bits.

So I denoted them here even though x1 is actually identical to u1.

I just wanted to denote specifically x1 is the sent bit.

So x1 is just the copy of v1. X4 is a copy of u4.

And once again the Ys are the noisy. Bits that are received after

transmission. On the other side, the Xs that you see

here on the bottom are parity bits. Parity bits are like checksums on a file.

They look at then completely different, the parity of different subsets of the

bits. So X5 here is the parity.

That is whether even or odd, zero if even and one is odd of U1, U2 and U3.

And see U1, U2 and U2. X6 tests the parity of a different set of

bits. U1, u2 and u4, and x7 does u2, u3 and u4.

So here, we're sending four original bits and three parity bits.

So, this in total is a code whose rate is four over seven because there are seven

bits sent of which four message bits or four message bits.

19:50

And then these are the large factors. And some of you will recognize this as a

beta cluster graph. A beta structure cluster graph.

And so this is just a list of applications, this is probably the, maybe

the most, one of the most ubiquitous applications of probabilistic graphical

models in use today simply because it exists in so many instantiations, you

know, of many set top boxes and when you're doing digital video broadcasting

in communications, mobile televisions, mobile telephones, to satellites, NASA

missions, wireless networks and so on and so one of the really powerful

applications of loopy belief propagation of probabilistic graphical models that

arose from this unexpected confluence of these two disciplines.

So. Really you could say that loopy belief

propagation which was originally discovered or proposed by Judea Pearl in

1988 was actually rediscovered, by practitioners of coding theories.

These were not theoreticians of coding theories, these were people actually sat

down and engineered codes. And understanding those turbo codes in

terms of loopy belief propagation and that was done by a bunch of more machine

learning and information theory type people led to the development of many,

many new and better codes, and the current codes are actually coming

gradually closer and closer to the Shannon limits in a constructive as

opposed to a theoretical way. And at the same time, the resurgence of

interest and belief propagation led us to much of the work that we see today in

approximate inference in graphical models as well as both on theoretical

understanding side as well as on new algorithms that people are coming up with

derived from that new understanding.