0:00

An alternative class of algorithms to the variable elimination algorithms is the

class of Message Passing algorithms. And, as we'll see, this is, a class that

is in some ways closely related to variable elimination.

It also offers us additional flexibility in how we do a summation and, factor

product steps so as to potentially come up with a lower complexity than would be

required by even the minimal elimination ordering.

So lets consider simple mark of network of the following type and lets assume

that I don't want to go to the cost of eliminating variables although in this

network its not going to be to expensive. And instead what were going to do is

we're going to construct what we call a cluster graph.

A cluster graph is something where is what, is a data structure in which we're

going to take little bits of knowledge from this graphical model and we're going

to place them in things called clusters. So here, we have four clusters in this

example. Cluster one is a cluster who's

jurisdiction of influence is the, the pair of variables AB.

Cluster two has jurisdiction over B and C.

Three is over C and D. And four is over A and B.

And these clusters are going to sit there, and they're going to talk to each

other, and they're going to try and convince each other that what they think

about a variable that they both consider to be under their jurisdiction is, is

correct. So for example cluster one is going to

talk to cluster two about the variable B, and it's going to tell cluster two what

it thinks about B, so that cluster two might become more informed about the

distribution over B. And so what we're going to do is

initially each cluster is going to have it's own little piece of information so

phi one is going to go here, phi two is going to go there, phi three goes there

and phi four goes there. And now the variables are going to

communicate by via these things called messages.

So we're going to call, we're going to slightly rename things.

We're going to call the variable, the initial set of beliefs, if you will or

evidence that a factor, that a cluster has over the variables in this

jurisdiction. We're going to call those psis.

In the example, we just had psis where just the psi in the original model, but

as we'll see sometimes it can becomes a little bit more complicated than that.

So now factor, the cluster one has the factor Phi one.

And if cluster two has psi two and so on, and now let's imagine that Si one wants

to send a message, oh sorry, this psi two, the cluster two wants to send a

message to cluster one. So it has to figure out, what it believes

and our, our priority we're going to assume that it's just going to, we're

going to start out by initialising with a totally uninformed message.

So, because initially they haven't even started talking to each other.

So all messages are initialized to be one.

But now cluster two can come back and say, okay I'm going to ping the

information, uninformative as it was that I got from cluster two, and notice that I

call this message delta two one. Delta two being the from and one being the two

and so if taken delta two, one and now factor one eh cluster one is going to say

well I'm going to take that I'm going to multiply it with my current thoughts

about the variables AB and that's going to give me a more informed factor.

And now, I'm going to communicate that information to factor four, to cluster

four. But cluster four doesn't care about a,

sorry cluster four doesn't care about b, and so what I'm going to communicate to

cluster four which is this message delta one four from one two four is the sum

over B, which is the variable that four doesn't want to hear about.

The incoming message. Times my initial beliefs, of my initial

factors. Now, this general process is what keeps

the message-passing algorithm going. So each variable.

Each cluster is going to send messages to its adjacent clusters that reflect this

exact same process. So for example, just take a different

different example here is delta three four.

So this is the message that goes from three to four.

And that message takes into consideration the evidence that two sent to three,

which is this guy over here, times whatever three thought about c d to begin

with. notice, and this is important, that the

message that three sends to four doesn't take into consideration the information

that it got from four. So there isn't in this expression over

here the contribution of delta four, three.

Because you want to have, you want to avoid this case of I repeat back to you a

rumor that you just told me and we all become more convinced about the truth of

this rumor, because oh, you, I, I thought about this first, but now you're

reinforcing me by telling it to me again and the beliefs are just going to go up

and up and up. And so what happens here is that we

deliberately only, restrict attention to evidence that.

Comes in from other sources. So three only uses evidence from two when

reporting to four, and it only uses conversely evidence from four.

One reporting to two. And so now this is, 'em, this defines a

set, a communication protocol by which one factor or one cluster rather in the

graph can communicate information, to its neighbors.

What do we do with this so how do we generalize this message

passing process? Let's construct a more general version of

this. so, this uses a data structure called a

cluster graph. A cluster graph is an un-directed graph,

whose modes are not variables anymore. Not as, not in, this is not a you know,

graphical model of the type that we've seen.

The modes in this un-directed graph are clusters that correspond to subsets, of

variables. Just like we had before.

6:52

And we're going to connect to adjacent, to clusters c I and c j.

And, this, this thing called the subset is the variables that they choose to talk

about. And clearly each one can only talk about

variables that it knows about, which is why the subset SIJ has to be a subset of

both CI and CJ. So once again CI is the jurisdiction of

cluster I. These are the variables that it

understands. And SIJ is the communication between two

adjacent clusters in the cluster graph. So now given a sent of factors Phi we're

going to initialize the model by giving each, each cluster in the graph a certain

amount of information. So each of my initial clusters, each of

my initial factors Phi K in my graph is going to be assigned to one and only one

cluster. And this is important, there needs to be

at least one so that the information is taken in to account somewhere and it

shouldn't be more than one because if you give it to more than one pers, to more

than one cluster they're going to, that your going to double count the evidence,

they each going to think that independent piece of evidence and it's going to be

counted twice. Now where do we put the information

corresponding to factor K? We put this only in a fact, we can only

put it in a cluster that understands every single variable in that factor.

So if we have a factor who's scope has a certain, that has a certain scope that

scope better be a subset. Of the variables that the cluster

understands. Because otherwise we can't even talk

about those variables. So,

So once we've done that, if I can erase this, okay.

We can now define the initial, beliefs, of a particular cluster.

As, the product, of all, of the factors, that are assigned to it.

9:21

Now some variables might, some clusters might be assigned one factor, in which

case psi is just equal to that phi. Some, that's, that's, that was the case

in the example that we just saw, some clusters might have several factors

assigned to them, in which case we need to multiply them to create a single

factor that is sort of the total informed beliefs of the cluster.

And some clusters might have no factors assigned to them, in which case this is a

null product and is equal to one. So now let's look at an example of a

somewhat more interesting cluster graph than the trivial one that we showed

earlier. Here we have a set of factors.

phi one of a b c, phi two of b c, phi three, phi four, five, six, and so on.

So initially we have to figure out for each of those factors a cluster in which

to put it. For a b c, there's really only one choice

because there's only one cluster in this entire graph that understands about all

of a b and c, and that's this cluster over here.

D C, however, has two choices, it can go into

cluster one or it can go to cluster two because both cluster one and cluster two

understand the about d and c. We're going to put it in cluster two.

I mean, we could've put it in either one. Both are fine.

Five, three has, again, two choices. It can go into cluster two, or it can go

into cluster three. We're going to go ahead and make the

decision to put it in cluster two. cluster five four goes, has only one

choice because only one cluster has both D and E in its scopes so it goes here.

Five five, similarly, only over here. B D.

Again, there is more than one choice. We can put it in, cluster two, or we can

put in cluster three. Let's, for simplicity, put in cluster

three. And BDF, only one choice.

This is one possible way of assigning the cluster, the factors to clusters.

There's other alternatives, as I said, that would work.

If we do this, we end up for example, with psi two.

Being the product of phi two times phi three.

12:14

Okay. This was one cluster graph for those set

of factors. Here's another cluster graph.

So let's compare them. One, two, one, two.

For different, for for the exact same set of factors.

And notice that it's still, notice that in fact the clusters haven't changed.

What's changed is the edges and the subsets between them.

Different cluster graph. Okay.

So now, let's think about message passing in the context of this more richly

structured cluster graph, to see what it looks like here.

So here, for example, if we're interested in passing a message from cluster one to

cluster four. We are going to take sign one which is

the factor the set of factors that were initially signed to cluster four.

We're going to take in the message that this, cluster got from it's other

neighbor two. We are going to multiply them and then,

we are going to sum out all of the variables that one understands but two

doesn't. So, here for example, one understands, a

and c and two doesn't so we have to sum out over a and c and that is the message

delta one four. What about delta four one?

Delta four one is the message, goes in the other direction.

15:06

Multiplies it with all of the other sums out the variables that cluster J doesn't

know about. So that everything that's in the scope of

CI, but not in the scope of CJ. And that gives us a factor over the

subset that is produced as a message. But putting that together that gives us

an algorithm which is generally called belief propagation for the reasons that

you, that clusters are propagating, if you will, informed beliefs to each other.

And here is the summary of the algorithm Each factor, phi, is first assigned to a

cluster. That is used to construct our initial

potentials. These PSIs that we talked about.

We initialize all of the messages before anybody starts communicating to B1 and

then we repeat the following process. We select some edge in the graph, between

adjacent clusters and we pass the message between cluster I and cluster J over that

edge, okay? And we repeatedly, and this is the

expression for that message it's this, the one that we saw in the previous slide

and that process is repeated again and again.

16:58

Now there's several important aspects of this algorithm that are left undefined.

And that we're going to need to talk about later on.

The first of these is repeat until when. When do we decide that we're done and

that we can stop. And we'll talk about that in the context

of different variants of this algorithm later on.

The second thing that's left undefined, is how I select each of the edge, how do

I select the edges? Which edge do I select to pass messages

over? So here there is you know the obvious

simple things which is just to go in some pre specified order.

and that's one option, is round robin passing.

But it turns out that there are better strategies than that and we'll talk about

that too. So is this algorithm any good?

Well it turns out that you know, yes and no.

So first of all if you pass messages over a graph like the one that I showed before

like the little, a, b, c, d loop, eventually you see that we achieved

convergence, so this is, here we can see it eventually, we do converge the sum

probability but that probability is a little bit off.

18:29

That is the answer is not exact. So, in general, with exceptions that we

will talk about this is an approximate. Algorithm.

Which shows that there's no free lunch, if the problem was n p hard, it's not

like I have an easy way of solving it. So given all these problems with the

belief propagation algorithm, what makes us think that it's an effective algorithm

to use? When the algorithm was rediscovered in

the 1990's, Murphy, Weiss and Jordan, as well as others, looked at its performance

in the context of the real networks. And specifically networks that have a lot

of loops, which is what causes the belief propagation algorithm to misbehave.

And so here is an example network, it's it's called the pyramid network, it's a

network that is analogous to one that arises in image analysis.

And what we, what they showed is that when you compute on the one hand the

exact marginals on the X axis and for different marginals in the network.

And on the Y axis, we see the marginals compute loop belief propagation.

We see that the marginals, by and large, sit almost exactly on a straight line,

with few exceptions. Showing that the loop belief propagation

is very close to accurate on this network.

Here's another network. This is a simplified version of a medical

diagnosis network. And once again, we can see that the, when

we compare the correct marginals on the x axis to the belief propagation marginals

on the y axis, the marginals sit almost exactly on this straight line,

which shows that again, they're very close.

So the belief propagation's very close to accurate.

So to summarize the belief propagation algorithm passes messages over a graph of

clusters that are connected to each other via subsets.

The adjacent clusters pass information to each other in these messages.

transmitting information only about the variables in the subset, which are the

ones that they have in common. The message that cluster i, sends to

cluster j summarizes everything that i knows, about, the variables in the

subset. Except, for information, that it, obtains

from j, so that one avoid, direct double counting, of the same piece of evidence

where j gets its own information back again via i.

We've seen that the algorithm may not converge and it can have this oscillatory

behavior. and that the resulting beliefs are

pseudo-marginals in that they are not necessarily the exact marginals from the

theoretical perspective. But nevertheless, as we've seen, the

algorithm actually performs quite well in a range of practical applications, which

is why it's quite commonly used.