0:00

A lot of our thinking so far has gone into constructing algorithms for the

problem of computing marginals in a graphical model, but as we previously

discussed a very different type of inference problem, but one that has many

applications in itself is that of finding a single coherent joint assignment that

has the highest probability and we showed already that, that doesn't have the same

that you can't just solve the marginals problem.

And pick the highest probability assigned for each variable.

And that gives you the map assignment. That's not what, that's not what happens.

And so we need to have a null, a different set of algorithms for solving

the math problem. And we're going to talk now about the

first of these, which is effectively following the exact same

lines that we did for exact inference for computing marginals.

Only it turns out that one can repurpose them with a small change to computing the

map assignment. Now the first operation that we need to

do in order to get this to work is we're going to take products and turn them into

senations. And we're going to be doing that by

observing p of the, our distribution P of I is proportional as we know to a product

of factors. And if we if we're trying to find the

argmax of that, particular product it can also be written re, reformulated as the

argmax of the summation. Where each of these theta K's is the log

of the potential 5000 which is basically the same as taking the table that

represents the factor and converting each entry into its logarithm.

And that gives you something that rather then being a product of factors, is the

sum of these log factors. And there's really good reasons for doing

that first, because summations are easier to handle than, than products.

But also because, if you're multiplying a whole bunch of very small numbers.

You're going to get numerical under-flows.

And converting this into summations is a much more robust algorithm, numerically.

In, in terms of a practical implementation.

And so now we're going to consider the problem of, that we just wrote here which

is the argmax. an expression that we're going to call

theta of X1 up to XN which is defined as the sum of factors over smaller scopes

which are these theta Ks. And this is the example that I meant to

show a moment ago. So this is taking a factor over scope a

and b and converting it to its log factor by taking log base two.

And you'll notice that the numbers were carefully selected to give me nice

outputs, but that doesn't have to be the case.

3:06

So now that we've reduced problem to one where we are taking the max over

summation, we're going to now go back to our simple example of a chain.

And we're going to think about how we can do max sum variable elimination in chain.

And this is going to look. It almost identical to the max, to the

sum product algorithm that we had before, so let's, understand what we're trying to

do here. So we're trying, assuming that what we're

trying to do now is forget the argmax. Let's assume that we just want to find

the max of this, of this expression, theta of A, B,

B, D, and E. And so just like in the sum product case

we can break up the maxes in to a max over A and then a max over B and a max

over C and a max over D. And what we might not be quite as used to

doing this kind of operation that I'm just, that I'm about to show it's equally

valid to point out that because none of these guys say the two, say the three or

say the four depend on A, we can, add them up, the max over, this guy over

theta one and AB, after computing the max.

And so that's going to give me, a, factor here which which looks like max over A,

theta 1 of AB, and I'm going to call that lambda 1 of B, because notice that this

is not a constant, it's a function that depends on B, for different values of B,

your going to have different, different A's of maximize and different values of

the max expression. And so, this lambda 1 of B is simply the

max over A of theta 1 AB. And that process has effectively now

eliminated A, from this expression, and given me A maximization problem over one

fewer variable. only B, C, D, and E.

Just like in the context of sum product algorithms we can view all of these

operations as operations over factors rather than just thinking about them in

terms of expression. So whereas before we defined things like

factor product and factor marginalization, we can now define

analogous operations that correspond to factor summation and factor maximization.

So factor summation, a very obvious operation does exactly what we did here,

so who want in, in, in case of factor product, if we want to define.

And the row for a1b1c1 we're going to add up.

The entry three from A1B1 and the entry four from B1C1 and that's going to give

me 7. And I can similarly define all of the

other entries in the factor summation which we're showing here on the right.

Factor maximization, is the same way that we could sum marginalize a factor.

This is something that, max marginalizes. It's called the max marginalization and

what it does is it looks, if I, so I'm trying to get rid of B I have these two

rows here, say, for example, A1, B1, C1 and A1 B2 C1 that only differ from B,

And my new entry here A1 C1 is going to be the max of these two entries over here

which, in this case, is going to be seven.

And similarly for say, A1 C2, I'm going to end up with a max of 4.5 and

two, which is 4.5. So this is another form of

marginalization where I remove the variable b but using an operation other

than summation. So, now that we've defined these two

operations, we can go back and define the max sum variable elimination in chains as

a set of factor operations where I eliminate A, then B and so on and so

forth. So, for example, having eliminated A as

in the previous step I can now do the exact same operation by noticing that

factors that depend on B are theta 2B and 1B so these other two.

I can move outside of the maximization. And that's going to give me a max over b

of a factor that is the sum of the two factors, feta 2bc.

And lambda one of b that I got from the previous elimination step.

And that's going to give me a new factor lambda two of c and the process continues

in exactly the same way as variable elimination did for the case of some

product. And that's basically the algorithm.

So now let's think about what we get at the end of this execution for the final

factor. What is lambda four, having gone through

lambda two and lambda three, now we get lambda four of e.

What is lambda four for a given value little e?

9:01

If we mandate. That E equal little E,

okay And so, this is a factor, and it gives me back value for each one of the

values, the score, effectively for each value little E of.

This is called a max marginal. In the same way that we took variable

elimination for some product and we use that to define a clique tree algorithm,

we can do exactly the same thing for max product, for, or max sum using the exact

same type of data structure. So, here we, we're going to use the exact

same clique tree and we have cliques over AB, DC, CD and DE, and we're going to as,

as usual assign the potentials, know the appropriate cliques using the

family preservation property. And now let's go ahead and see how

messages are passed in this clique tree architecture.

So, initially the AB clique is going to defines the message lambda 1, 2 which is

obtained by maximizing out A over theta 1 and the resulting message over B gets

passed to clique 2. clique 2 can take that message, multiply it with its own

factor theta 2 and add that, this is max sum remember, I'm going to add that to

lambda 1, 2 and that gives me the message lambda 2, 3 and the same thing can happen

when clique 3 passes a message to clique 4 and in this case over the scope D.

So exactly the same message passing process except using max and sum instead

of sum and product. We can equally well pass messages in the

other direction. So 4 to 3 over scope D, 3 to 2 and 2 to 1

and all of the properties that we said hold in the context of some product

clique trees hold here as well first notice for example that once this message

lamda one, two is sent it never changes again its its defined once and for all.

lambda two three, once it receives the message lambda one two, it too has

stabilized and will never change. And similarly from lambda three four.

So we can do one pass from left to right to compute all of the left to right

messages. And a similar pass right to left to

compute all of the right to left messages.

And as soon as we do that, they've all converged.

The second thing to notice is what is the value of this clique over here.

So if we look at this, we can see that when I've computed, when I finally get

all of the messages from both sides, I have integrated in theta 1,

theta 2, theta 3 which is stored in the clique itself, and theta 4 coming in from

the message on the right. And maximize out all of the variables A,

B, and E so that what I have left is a factor over the clique three, which is

the max over A, B, and E of the sum of all of the factors in this network.

So just to summarize once the clique I receives the final message from all of

it's neighbors except cj then that message is also final, will never change,

and the messages from all leaves are immediately final and so what we have is

an algorithm that at converges that converges after two passes and gives us

correct max marginals at every single one of the cliques.

So let's take a simple example just to convince ourselves that this is doing the

right thing. So, I'm, looking now at a much simpler

not, only have A, D, and C. So, there's two factors one over theta

one of A, D and theta two. Was b, c.

And first, we're going to construct the overall max feta, which is the sum of

feta one and feta two. So this is a, factor summation, went into

this expression, and let's look at it, and see what is, the map assignment, in

this example, and if we just look at the numbers that we computed, we see that the

map assignment is A1D1C1 with a value of seven.

14:03

Now let's look at the message passings process that you would have on this very

simple cleet tree with two cleets. So this is Theta one, it's assigned to

clique one, Theta two is assigned to clique two and let's look at the two

messages that are passed. So here was my A1, so I'm, AB passes a

message to BC and that message is the max marginal.

Our max marginalization over, the variable A.

And so we can see that here we have a max between three and, N, negative one.

So we get three because this is the max over the two values of A that are

consistent with B1. And for B2 we have the max over zero and

one, so we get one. And exactly the same process gives us,

for this message. Where I'm max marginalizing C, I get this

message that's passed from left to right. Now, each of these two cliques takes its

message. Notice that I've gotten immediate

convergence here, 'because there's only one message to pass in each direction.

And so what I get here is the sum of this factor plus the incoming message.

15:55

And I can do exactly the same operation to get this to get this a set, to get

this factor on the right, I get that by summing up this with that, and I get this

factor over here. And you'll notice that, miraculously the

MAP the max in each of these factors separately is A1B1 on the left and B1C1

on the right. So, sure enough, what I got here was the

most likely assignment consistent with A1B1 and over here the most likely

assignment consistent with B1, C2. And you'll notice if you go back and

check and I'm not going to do that right now, but if you go back and check this

big table over here you can convince yourselves that this doesn't only hold

true for A1B1. If you look, for example at A2 B1, you'll

see that what you get here, the value three is the value of the most likely

assignment that are consistent, that is consistent with A2 B1.

So actually let's go back and check that, so here is A2 B1.

We have two assignments consistent with A2B1 one is 0.5 and one is 3.

3 is the best one and sure enough the value that we get here is 3 so it all

works. So what will we say about this algorithm

once it converges. The important think is that we can

complete beliefs of each cleek that represent exactly the max marginals of

that cleek. So how do we compute these beliefs and

now listening to what we had in the context of some product we look at the

factor assigned to the clique which is the theta i and we add to it the incoming

messages. Before you remember we had the factor

scientific cleet multiplied by the incoming messages because we were doing

products instead of semations. What does this what does this belief in

code this belief is exactly the max marginal.

So specifically, it, we can consider for any given assignment to the clique CI.

We can look at the best possible, the score of the best possible completion to

the clique CI. And that is the value of the belief at

that clique. So, it's the max over all of the

variables WI, that are not assigned to the clique.

One important consequence of the fact that we have max marginals is that we get

a, a calibration property here, as well. So, the cliques must agree on the

variables that they share. So to understand that let's look at these

two cliques that we have in the simple two clique example.

And here are the beliefs that we had for clique one and for clique two.

It's the sum of theta one plus the incoming message.

Over here. And, theta two plus lambda two.

And let's look at what the calibration property tells us.

It tells us that for example, if we look at.

The implications of this cleek regarding the variable b, we would have that the,

this cleek tells us that the best possible completion for b1 has a score of

seven and the best possible completion for b2 has a score of three.

This clique if we look at, it tells us that the best possible completion for b1

also has a score of seven and the best possible completion for b2 has a score of

three. So, these two clique agree regarding

there regarding the variable in the shared

scope which is the variable B. So to summarize, we can apply in the

context of max sum exactly the same clique tree algorithm used for sum

product. Messages are passed in the same way.

The clique tree is constructed in the same way.

The only difference is that the message passing operations use max and sum as

opposed to sum and product. At exactly as in some product convergence

is acheived after a single up and down pass, and the result of that is a set of

beliefs that represent max marginals at each of those cleeks where as a reminder,

the marginals tell us for each assignmnet what is the score of the best possible

completion to that assignment.