0:00

We previously defined the map inference problem,

which is that of finding the highest probability assignment to a graphical

model. And we talked about particular classes of

models for which the map problem was tractable.

One was a class that of models that had sufficiently low-tree width so that one

could use techniques such as the variable elimination or clique tree methods.

And then we also talked about a variety of other special purpose classes where

one could do tractable inference. What we're going to talk about now is a

general purpose algorithm that can be used to solve any math problem.

Keeping in mind, of course, that the math problem is an empty hard problem and so

you're not going to get a fully polynomial time algorithm that that one

can utilize in the context of any problem.

So, the class of methods that we're going to talk about is an is called dual

decomposition. And, it's derived from the increasingly

prevalent view of the map inference problem as an optimization problem and

building on techniques from the field of optimization theory.

So, let's go ahead and first reformulate the problem in a way that makes it

amenable to this kind of analysis and this is just a convenience.

So, here we're going to assume that the math problem that we are trying to

optimize, remember, we reformulated this as a max sum as opposed to a max product

which is a convenient reformulation. And we're going to assume that the theta

that we are trying to optimize is comprised of the sum of two different

kinds of factors. The first is singleton factors which are

over the scope of a single variable Xi and the second are larger factors that

are over the scope of multiple variables. And, clearly, one can fold in the

singleton factors into some larger factor that contains a variable in its scope,

but it turns out to be convenient to keep them in the model for reasons that will

become clear in a minute. So now, our goal is to find the value for

the moment. We're going to talk about how to find an

actual assignment in a little bit. But imagine that our goal is to just figure

out what is the value of the highest probability assignment X and so we're

trying to find the X that maximizes this total summation.

2:29

But one thing that we could do, which would be wrong, but we could do it is to

say, well, let's forget about the fact that we're trying to do a single joint

assignment, rather, we're going to think about each

problem with blinders on. So we're going to think about each theta

i, Xi by itself and we're going to find the Xi that optimizes that factor and

we're going to do the same for the larger factors.

Now, that of course, is going to be a very poor solution, because the Xi that's

going to optimize one theta i is going to be completely inconsistent with the Xi

that is selected by one of the larger factors that contains Xi.

So you're going to get a completely incoherent joint assignment. It's not

going to be a joint assignment, it's going to be a bunch of different

assignments that don't that don't agree with each other.

So, what dual decomposition does is it's going to try and do the same kind of

divide and conquer, but it's going to do it in a way that tries to force these

local decision making problems to agree with each other.

7:00

And only a function of lambda, because it's no longer a function of X, because

of maximized over the X. So you're going to get a different value

of this function depending on the choices of the penalty parameters.

This function is an upper bound on that theta for any value of lambda.

Now let's understand why that is. It's, we've seen in this line over here

[SOUND] that this is actually equal to that.

And we, we've shown that everything cancels here, and what I've done here is,

over here I'm forcing the single X to maximize this entire expression together,

and here, I'm letting each term be optimized seperately.

And that gives me more degrees of freedom,

because here, I had to pick the same X across the board, and here, this one gets

to pick one X and this one gets to pick another value of X.

And so since the assignment where they all agree is is one of the assignments

that can be considered in this optimization problem over here,

8:39

So, introducing some notation which we'll use in a little, which we'll use in the

rest of this presentation. We're going to call the function that the

i slave is optimizing theta i lambda and the function that is being optimized by

the F slave theta F lambda. So let's take an example to make this a

little bit more concrete, because this was a little bit abstract.

So let's go back to our to our example of a four-way loop, where we have X1, X2,

X3, and X4. And we're going to have these four

pairwise potentials, which we're going to denote with letters this time to

disambiguate indices a little bit. So we have theta F(X1, X2), theta G(X2,

X3), and theta H and theta K. So now, how is this decomposition going

to work here? Let's assume for the moment that we're

going to decompose this over edges. So so our factors are going to be, for

example, the edge X1, X2. So now, what is the optimization problem

that this factor chooses? Well, it has its own potential,

theta F(X1, X2), and then it's going to have these little

costs that are going to try and get it to agree with X1 on there with with the one

slave on X1 and with the two slave on X2. We're similarly going to have for X1, X4.

This is the K, so this is the F slave, this is the K slave and it's going to

have its own potential, theta K, and two penalty terms that are going to

try to make, to make it agree, with the one slave on x1 and with the four slaves

on x4. And we similarly have exactly the same

form for the other two slaves, the G slave and the H slave.

In addition, we have slaves for the individual variables, the one, two,

three, four slaves. So here is the one slave and notice that the one slave has

its own potential theta one X1 and at the same time it has these two terms that are

trying to make it agree on the one side with F, which is over here, so this is

because of this and on the other side with K because of that.

11:48

Now I've done this in the context of a very simple scenario, where I've broken

up my model into the factors that existed in the original specification.

So, for example, I had factors that were pairwise potential, so I had a slave for

each such factor. But, that doesn't have to be the case.

[COUGH] In fact, the only constraint that I need to satisfy is that I need to, to

define a slave to be defined as a subset of factors that admit a tractable

solution. So that each one can be optimized

efficiently when considered in isolation. So for example, if we can, if we return

to this network, say, one thing we could do is we can introduce two larger slaves,

instead of the four smaller F, G, K, H slaves which one is corresponding to F

and G together, so this is the F, G slave and the other

one representing the K, the K, H slave. And in this case we think about the

agreement, we still have the one, two, three, four slaves that represent the

individual variables. And, so now, we're going to have to

require that theta F agrees with X1 with a one slave on X1 with a two slave on X2

and a three slave on X3. And so we have this expression, which has

a lambda term for each of these three variables.

And we have a similar term for a similar expression for the K, H slave, where in

this case, the K, H slave has to agree with the one slave on X1, the four slave

on X4, and the three slave on X3. If we now think about the singleton

slaves, for example X1, we see that X1 has to agree with the F, G slave so there

is a term to encourage that and with the K, H slave,

because it's present in both. On the other hand, the X2 variable only

occurs in the first slave and so it only has a single agreement term with with the

F, G slave and similarly for X3 and X4. So, how do we then construct the slaves?

In pairwise networks, what we typically do is we divide the factors into a set of

disjoint trees, and this is in fact, what I showed in the previous slide. We

divided it into one tree that went through the first two edges and the

second one that went through the second two edges.

And then, and we define, and they are disjoint, which means every edge factor

is assigned to exactly one tree. And because they're trees, we know how to

optimize them efficiently using techniques like variable elimination or

clique tree. But we also discussed that there is other

classes of factors that are, that offer tractable inference.

So, for example, we talked about matchings or models that are associative,

or regular, or super modular, or whatever we call them.

and all of these are tractable classes and we can take a whole bunch of factors

that satisfy, that, that fit into one of these

tractable categories and optimize them together in one fell swoop.

We don't need to divide into separate edges.

So let's look at an example of this. Remember, a while ago, we talked about

the problem of 3D cell reconstruction, where we have a a cell over here that

we're imaging and we're taking various images that are slices two-dimensional

slices through that cell, and we'd like to reconstruct three-dimensional

structure. So here, as as we discussed, we have the

slices and we would like to correspond the beads that we see, these little gold

beads that we see in these different images to each other,

so trying to figure out that this bead corresponds to this bead over there.

And from that, we can then subsequently obtain the 3D reconstruction.

Now, when we last talked about this, we described this as a matching problem

where the matching weights between a point here and a point here are derived

from the similarity of both the location in the 2D slice and the appearance of the

local neighborhood. But it turns out that neither of these is

sufficiently distinctive in order to really make that determination robustly,

that one point in, in an image matches another point in the other image.

And so, in order to do this with a reasonable success, we need to have

another set of potentials, which are pairwise potentials which basically

preserved distances of these marker positions.

So, tell us that if we had, if we want to correspond this point over here to this

point, and this point to that point,

then the distances between them should also roughly be preserved.

Now, this set of potentials is relatively sparse, and therefore, can be solved

usually using exact inference. This set of potentials is is complicated,

is not, is not complicated on its own because it's a matching problem.

If you put them together, it turns out to be a difficult problem to solve.

In fact, you can show that a joint set of potentials like of, like this, with these

two different categories is in [INAUDIBLE] hard, but using dual

decomposition, we can solve each of these separately and then communicate

information between them, so each of these would form a tractable

slave.