0:00

We previously defined the intuition for the dual decomposition algorithm which

takes an optimization objective and decomposes it into a bunch of different

slaves. And uses the communication between the

slaves to try and get them to agree with each other about the overall map

assignment. So now, let's take this intuition and

convert it into an actual algorithm. So, what we have here is each slave gets

its own pet objective function. So here is the objective function for the

ith slave, for the slave corresponding to the variable xi.

And here is the objective function that the slave corresponding to the larger set

F is trying to optimize. Now, each of these initially has, as a

component, its own local objective function that is assigned to it.

But in addition, we're going to introduce for each of those a set of terms,

these terms, that modify its objective function so as to try to get it to agree

with with the other slaves. Specifically, we're going to try and get

the x, the i slave to agree with the factors F,

in which i appears. And conversely, we're going to try and get the F slave to agree

with the slaves of the variables in its scope.

So, let's see how that gets done. We are going to initialize all of these

incentive terms, all of these lamdas, to be zero.

Initially, each slave makes its decision based purely on its own local piece of

the objective. But then, as the algorithm evolves, these

incentives start to change. So, how does that happen?

At each iteration t, each slave optimizes its current

objective. So, the i slave optimizes theta i hat

lambda, where lambda is the current is the

current set of the penalties. The current set of penalties.

And that gives it an assignment which we're going to call xi star.

The F slave does the exact same thing. It optimizes its own local objective,

which is theta hat F. Also using the same set of lambdas.

And that gives it x hat sorry. x star of F, which is this guy over here.

Now, let's see what happens if there's a disagreement between a large slave F, and

a variable i in its scope. What we would like to do is we would like

to incentivize them to agree with each other.

And so, what we do is that for all F, and all variables i in F, if they

disagree. That is, if the chosen variable for xi in

the assignment chosen by the factor F is different from the assignment of i chosen

by the i slave, then we need to change the incentive

function. And we do that by basically changing the, the, the lambdas.

And so, specifically we see that here when we look at the penalty for the pair,

Fi, which is over here.

We decrease alpha t, we'd decrease by a factor alpha t, which is greater than

zero. the preference for xi star,

which is the value chosen by by the i slave.

And if we see that, if we look at that, we see that, that is going to

disinsentivize] the i slave from picking the value that it picked.

Conversely, we increase the preference for the var, for the value chosen by the

F slave, which is this value over here,

by the same amount, alpha t. Now, that changes the preference of the i

slave, but it also changes the preference of the

F slave by the exact same amounts, but in the opposite direction.

Because notice that, here we have a negative sign,

whereas here, we have a positive sign. So it makes both the i slave and the F

slave move in the same direction, move in opposite directions so as to agree with

each other. So, what happens as we continue doing

these iterations over time? Well, one can show that under fairly weak

conditions on these on these parameters alpha t, the lambdas are guaranteed to

converge. What are these conditions?

The alphas need to be big enough so that they make a difference.

So that if we sum up all of the alpha t's for all t we end up with diverging

sequence that sums to infinity. On the other hand, they need to be small

enough so that we eventually achieve convergence.

And that is the thing by the second condition on the, on the summation over

here. Now, in fact, one can also show that not

only do we get convergence, we get convergence to unique global optimum

regardless of the initialization. So, the that lambdas always converge.

Now, the question is what are the implications of that, of the convergence

of the lambdas? And, so what happens at convergence?

At convergence, you can show that there's no more change.

And so at that point, each of our slaves has its own locally optimal solution over

its own variables, the ones that are in its scope.

[SOUND] So, is that good enough? Unfortunately, not always. That is, even

though we have achieved convergence, the solutions of different slaves may or may

not agree on the variables that are shared between them.

However, if they do agree, if we can get all of the slaves to agree with each

other, we can, we have, we can show that the shared solution is a guaranteed map

assignment. So, if the slaves agree, we have

effectively our initial optimization problem, which is that of finding a map

assignment. If that doesn't happen,

that is if the slaves have not agreed on their shared variables, then we need to

solve what is called a decoding problem. Where where we need to somehow put

together these different solutions that the different slaves have converged to

and come up with a single joined assignment.

So, how do we do that? How do we solve this so-called decoding

problem? It turns out that people have constructed

a range of different solutions for this. None of which have particularly strong

performance guarantees, but they often work quite well in

practice. So, for example,

if we use the decomposition of the of the original problem into spanning trees,

then each of the standing trees covers all of the variables.

So we can take the map solution of any individual tree and say that, that is a,

an assignment to all of the variables. So, that's one approach.

A second approach that is not a one, a winner takes all, is to have each of our

slaves vote on the xi's that are in its scope,

and then for each xi, you just do a majority voting.

So, that's another approach. A different approach uses some kind of

weighted averaging of the sequence of messages that are sent regarding each xi

through the course of the iterations of the algorithm.

So each of these works reasonably well, but, but it turns out that the right

solution is to simply use multiple solutions and then pick the best one.

And, the critical observation here is that the scoring function theta for any

assignment is something that we can easily evaluate.

So, why not do all of these and use multiple trees,

and generate a whole collection of x's, and then evaluate theta x for each of

those solutions, and then pick the best one?

And it turns out that when you do that, you generally get solutions that are

better in terms of their overall score than any fixed strategy.

Now, the other nice aspect of this analysis comes back to a point that we

made earlier in this in this module. Which is that for any lambda, this

function that we defined as L of lambda is an upper bound on the value of the

true map assignment. Now, what does that imply?

It means that if you give me any x that you think is a reasonable candidate,

so this is a candidate for the map assignment.

12:19

The third which we briefly mentioned before is how do you adjust the step size

alpha t so as to move things towards convergence fast, but still slow enough

to generate convergence. There's some harden fast rules that

guarantee converges, but they're often quite slow and there's cleverer rules

that still guarantee converges and tend to work better in practice, and this is

an active area of research that's ongoing right now.

And, then finally, we've talked about the decoding problem and the fact that there

is many heuristic choices there. And so, that's another design choice that

one needs to make when implementing dual decomposition.

So, let's summarize different aspects of what we've talked about.

First of all, let's summarize the algorithm.

Dual decomposition is a general purpose algorithm for map inference that works by

dividing the mole into tractable components that could be potentially

multi, multiple factors in a single component or slave.

Solves each slave locally and then passes these lambda messages between them to try

and get them to agree. It, and, and, we've talked before about

the presence of tractable map subclasses. Any tractable map, map subclass can be

used in the setting as a slave, will allow slaves to be large and richly

structured as opposed to just little tiny factors that are going to be only making

very myopic decisions based on minimal information.

One of the nice things about the dual decomposition algorithm is that is has a

lot of theoretical backing to support it. So, formally,

and we're not going to go into the details of what this means,

but just mention what this actually doing from a formal perspective.

This is a sub gradient which is kind of like a gradient optimization, but not

exactly. A sub gradient optimization algorithm on

a problem that is a dual problem to the map problem.

And if you didn't understand that, that's okay.

This is just sort of a pointer for those of you who might have seen some

optimization before to kind of fit this in to that world.

But, this view does provide some important guarantees, some of which I've

mentioned in this presentation. First, it gives us an upper bound on the

distance to the between where we are and the true map assignments.

And, it provides us with conditions, such as agreement among all of the slaves that

guarantee a condition under which we have actually found a true, exact map

solution. And there's even some interesting

analysis, which I'm not going to present, that tells us when one decomposition of

the model into slaves is better than a different decomposition of the model into

slaves. Now, in practice, this algorithm has some

significant pros and cons. On the pros, it is very general purpose.

You could really use it for any model if you just, if you're willing, for example,

to decompose the slaves fine enough. Of the algorithms that are currently out

there, it provides some of the best theoretical guarantees, in terms of

convergence properties. And, because of the way that it is

structured, it can use these very fast, specialized map subroutines for solving

large-model components, like matching problems, for example, or modular or, or

subsets of edges that are are sub-modular associative.

On the other hand, it's really not the fastest algorithm out there.

That is, there are algorithms that, when they work, work considerably faster than

than the dual decomposition algorithm. And, more so than most, it has a lot of

tunable parameters and design choices which make it a somewhat finicky

algorithm to apply in practice because one has to play around quite a bit in

order to get the performance.