0:01

We define the belief propagation algorithm as a message passing algorithm

over a fairly general data structure called a cluster graph, and we showed

that if the cluster graph has certain properties, then so does the belief

propagation algorithm. What we're going to do now is we're going

to look at a special case of the belief propagation algorithm, where we pass

messages not only a general cluster graph, but rather over a data structure

called a cleak tree, which as we'll show as a special case, a cluster graph.

But interestingly, when we pass messages over cleak tree, we can show that the

belief propagation algorithm has much stronger performance guarantees than it

does over the general graph, and specifically we can show that not only

can we get convergence must faster, we can also get convergence.

To the exact answers for probabilistic inference.

So let's consider the case of passing messages tree, and let's start with the

simplest kind of tree which is just a chain. So in this case we have a chain A,

B, C, D, E and we're going to assume this is and undirected MRF or CRF, and so we

have a for simplicity we're only going to consider the Pairwise potentials without

loss of generality, and so we're going to have just Pairwise potentials here that.

For example, in the cluster a b, we're going to have the pairwise belief, the

pairwise potential over a b and so on. So.

so here we have si one. Side two.

By three and si four. So now let's think about the message

passing process, and we're going to start passing messages from the left.

And so psi, so cluster message the cluster two and so there is no incoming

messages at this point the message that get, that gets past is simply the sum

over A of psi one of AB. And that's a message who's scope is B.

Now we're going to have cluster two take that message and pass the message to

cluster three. And so now it's taking it's own

potential, side two. And multiplying in this message into

this. And the result is a factor over B and C.

And we're going to sum out over B. Because we only want to restrict to the

scope C. which is the shared scope between cluster

two and cluster three. And so delta two, three is this message

over here. Now we're going to pass messages, one

more time all the way down to the end. And so now, again, we're taking delta 2,

3, multiplying it with psi three that comes from there.

And that gives us a message over scope D, which gets sent to cluster four.

2:54

You could also pass messages in the other direction.

So this for example, would be what would happen if we passed messages from cluster

four to cluster three. regardless of when that message gets

passed, the, there's no other messages that

cluster four would multiply in when sending a message to cluster three.

Because, whether it got the message from three or not, it's not going to multiply

that in. Because, remember, you only send the

cluster a message that doesn't involve messages that you received from it.

And so, in this case, the message that you get is the, is the sum over E of the

factor psi four, which we get from here. Passing messages in the other direction

again, we see that the message that cluster three passes to cluster two.

Which is delta 3-2. Involves psi three, and this message over

here. And finally, exactly the same, idea.

3:54

These are the messages that gets passed. Okay.

The one important property that you have when you run message passing in the way

that we just showed is that somewhat surprisingly relative to what we've seen

before the resulting believes that you get are correct.

And let's convince ourselves about relative to cluster three and so in the

order that we defined cluster one pass delta twelve to cluster two which was

then used to define delta 23 at the same time we had delta 43, and so the believes

that we get here are the product of cite three?

And delta two three is delta four three. Let's unpack each of those messages.

Though delta two three, as we remember was defined here.

And so we are going to rewrite what delta two three is as the summation over B of

si two delta one 2.. And similarly we unpack the delta four

three to get using this expression that we have over here right there.

5:08

Now we're going to do one further level of unpacking.

And we're going to remind ourselves of what delta twelve is, which is as defined

over here. And so that gives us this expression over

there. And now look at what we have.

We have a product of all the four initial factors in the network.

Psi one, psi two, psi three, and psi four.

We have a summation over all the variables that are not in cluster three,

which are a, b, and e. C and d are in that cluster, so the ones,

we sum out over everything else. So when do we see that we have a product

of factors. Marginalized out.

6:12

So this is really reminiscent of variable elimination.

And, if you recall when we talked about the correctness of variable elimination.

We said that variable elimination is correct, as long as you, as when you sum

out a variable, you first multiply in all factors that relate to it.

That involve the variable in their scope. So let's convince ourselves that, that in

fact was going on here so let's look first at the variable E and when we sum

out E we're multiplying we only are we're only involving si four but that's okay

because only the cluster four has E in its scope.

here we're summing out A. And once again only psi 1 involves

involves A in it's scope. Here we're summing out b, and really

there should be parentheses here. They were there implicitly before.

And so you can see that when I'm summing out b of involved side two, which

involves seeing b in its scope, as well as what was left over from summing out

side one, which was the other factor that has b in its scope.

And so you can see that I've marginalized out the variables in a legal order,

multiplying in only the expression multiplying in all of the expressions

that I needed to multiply before I did the summation.

And so this is a legal order of operations.

And therefore, it gives me a correct result at the end.

7:49

So now, let's expand that into a somewhat broader into the somewhat broader case.

We're now going to define this notion called the clique tree.

And a clique tree is an undirected tree, kind of like a cluster graph was an

undirected graph. So, in this case, it's a tree.

Such that, once again, the nodes are clusters.

And, once again, we have edges between the clusters,

which, are called subsets. And in this case,

whereas before, we had the, the subset needed to be a, only a subset of CI

interception CJ. It could be smaller.

It could be equal or it could be smaller. Here, we're going to require that the

subset is actually the intersection of the two adjacent clusters.

8:39

So now let's go and understand what properties the cluster, the cleak tree

that we showed, has, as a special case of a cluster graph.

So for a cluster graph we had the notion of family preservation, and the reason

that we had family preservation is because if we have a set of factors phi,

we needed to take each phi k in phi, and we needed to assign it to some cluster,

so that so that each case of information we have is represented somewhere.

And the factor that needed to be such that the cluster can accommodate its

scope, so that the scope of the factor was a

subset of a set of variables in a cluster.

And the same requirement holds here, because a clique tree is a special case

of a cluster graph. So that we have that for every factor,

phi K, there must exist at least one cluster whose scope is big enough to

accommodate that factor. The second property that we had in

cluster graphs was the running intersection property, and let me clarify

that the property that we see here is the cluster graph variant.

We're going to make it. Specific to clique trees in just a

minute. So the cluster and graph variant told me

before that for each pair of clusters C1 and CJ and a variable X that is present

in both of them, there needed to exist a path between CI and CJ for which all

clusters and subsets contain X, and it needs to be unique.

So what does that mean in the context of creet trees, let's assume, that we have,

C7, containing X, and C6, containing X, and, the cluster and the running

intersection property says there needs to exist a path between them, through which,

all clusters in sepsous contain ax. Well there's only one path in a tree

between C7 and C3, it's this one. And so what that property now tells me is

that x has to be here, here, here, and here.

Because the running, otherwise the running intersection property is going to

be violated in this case. So it means that in the contexts of

trees, the running intersection property can actually be written much more simply,

it can be read as saying that, pair of cos-tar ci and cj and, variable x in

their intersection. Each, cos-tar and subset and unique path

between them, must contain x, and so this is the just the instantiation of running

intersection property to the context of the trees.

11:25

So now let's look at a, a more complicated clique tree.

This is the clique tree that corresponds to the networks that we previously used

to demonstrate variable elimination this is the elaboration of our student

network. And so this is unlike the previous

example, this is actually a clique tree for a Bayesian network as opposed to a

Markov network but as you'll see there's really no difference.

So here we have a set of factors. in this case, the factors are CPD.

And because of the, family preservation property, we can be assured that each

factor, must be assignable to some clique.

And so here, we have, p of c. And p of d given c are assigned to this

clique. P of G gives an ID, p of I and p of s

give an I and so on and so forth. So each of the CPD's we have is assigned

to one and exactly one of the cliques in this clique tree.

And, we can convince ourselves, and we can convince ourselves that this tree

satisfies for example the running intersection property.

So let's look for example at the variable G.

The variable G exists here, it exists here, and sure enough it exists here,

here, here, here, and here. And you can similarly convince yourselves

of that for any of the other variables in the squeak tree that the running

intersection property holds for them. Let's look at what message passing looks

like for this more elaborate clique tree. So it's a little bit, sli, only a slight

bit more complicated. So here we have, for example, delta.

so it's not even that much more complicated, so here's delta two, three,

for example, and delta two, three, which is the cluster betwe, which is the

message paths between cluster two and cluster three takes, thy two, which is,

the product of the factors assigned to this quee, there's actually only one in

this case so it's only PG gets and IND, so that's thy two, multiplies it with the

incoming message delta one, two and sums out, over D.

The only slight difference between this and the previous example that we had is

that here for example when cluster four passes a message to cluster three, we can

see that over here there's two variables, J and L, both of which must be summed out

in order to produce the message over the requisite subset.

14:01

Now, we preview,, we defined the running interception property. One of the really,

nice aspects of the running interception property, is that it, by itself, together

with family preservation, suffices to prove the correctness of message passing

in a clique tree. And the reason for that is the following.

if so let's consider a variable x over here.

And, let's and what we want to prove is that

14:41

is that if we eliminate X at some point in the tree, at some point in the message

passing process. Then we have multiplied in all factors at

involve X. And, so, here we have, for example, X

here and here, and by the running intersection property, we have the X is

here and here, as well and on this axis. And, let's imagine that X is eliminated

on this edge. That is the message that's passed through

C3 to C5 doesn't the message is passed from C3 to

C5, doesn't involve X, and so X must be summed out at this point.

16:03

And the reason we know that if X is in C5 then X is in S3, 5, is because of the

running intersection property. So the running intersection property

relates the structure of the graph to the structure, to the the time at which

variables are eliminated. So now let's convince ourselves of the

correctness in this slightly more complicated example that, that we've seen

satisfies the running interception property.

Here, once again, we have, psi 1, which is the product of these two

factors. Psi 2, which is this one, psi 3, which is

these product of these two factors. Psi 4, psi 5.

16:57

And once again let's, for example, consider what we get at cluster three.

And we can see that we have multiplied in all of the potentials, side one, side two

side three, side four and side five. And that we have eliminated all of the

variables except for g, s, i. So, here we've eliminated c, d,

j, l, and h, leaving us only with the variables g, s, i, which we want to keep.

And we can similarly convince ourselves, as we did before, that when we eliminate

a variable, we've multiplied in all of the factors that involve that variable.