0:00

We define the notion of the general Markov chain.

We talked about how certain conditions guaranteed that if we sampled from the

chain we achieve conversion sistationary distribution.

We then described how if we then generate samples from the Markov Chain we can then

use that to estimate properties of that stationary distribution like varying

statistics that we might care about. But the question of where the Markov

Chain might come from was left very vague.

It turns out that four graphical models people have constructed a general purpose

sampler that can be used to sample from any graphical model, and that class of

chains is called the Gibbs chain. So now we're going to talk about the

Gibbs chain and how to use in the context of graphical models.

So, here are target distribution, the one from which we would like to sample is

the, good ole, Gibbs distribution, P of Pi, of X1, up to XN.

And just as a reminder that P of Pi can equally well represent a, a Gibbs

distribution that we got from just the product of undirected factors.

A Markov net style distribution. Or it could, as well, represent a set of

factors that [INAUDIBLE] Bayesian network, reduced by evidence.

And so, that is the factor set Pi. And everything that I'm going to say from

now on is agnostic to where those, factors Pi came from.

If they came from a directed or an undirected model.

So, now our Markov chain state space. Is a set of complete assignments, little

x, to the set of variables big X. So we're sampling over, note, an

exponentially large state space. The space of all possible assignments of

the random variables in a graphical model.

So those little circles that we saw when we were talking about mark off chains,

and we were g, going from one state to the other.

Each of these now is now a complete assignment.

[INAUDIBLE]. So it turns out that the Gibbs chain is

actually a very simple and easy to understand, Markov chain.

And here's what it does. Assuming that I have a starting state,

little x. What I'm going to do is, I'm going to

take each variable in its turn, using some arbitrary ordering.

And I'm going to sample. A new value for that variable given

values for all of the rest. So let me introduce this notation, this

is an assignment. All.

x1 of the xn except xi. So for example if we had three random

variables: x1, x2, and x3. We would sample x1 given x2 and x3.

X2 given x1 and x3. And x3 given x1 and x2.

Right, so one at a time. Okay so let me, let me draw that out

because in order to make that clear. So here we would have, here's my current

assignment, let's say here's X1, X2, X3, and let's say that we have zero, zero,

zero as my initial assignment. So I'm going to start by.

Sampling X one from the probability of X one given X two equals zero.

3:32

And x3 equal zero, and let's say I end up with one, that other two haven't changed

yet. I now sample from p of x two, given x1

equaled the new value one, and x3 equals zero.

And say that's thay zero. Could.

And then I sample x three from x, the probability of x three given x one equals

one and x two equals zero and save our changes to the value one and so now this

is my new assignment 101 which gets substantiated into x five.

That is a single step of the Gibbs chain. It goes and potentially modifies, it

doesn't have to modify because you might end up staying in the same assignment.

Resampled every single variable, in its turn.

And when all of them are done, that's my new state.

4:37

So let's do this example in the context of a real graphical model.

So here is the good old student Bayesian network.

And I'm assuming that I have observed two of the values, two, values for two of the

variables. So I've observed l equals l zero and s

equals s one. And so now I have a product of, reduced

factors that represent the evidence that I received and only three variables are

now getting sampled because the other two are fixed so they observe values so now I

am sampling from P of P tilta of Phi, I mean, I like the sample from P tilta Phi

of DIG, given little S1 and L0. And so what I'm going to do in this

sampling process is I'm going to start out with some arbitrary assignment to

these three variables, D, I and G. So, for example, since it's arbitrary,

D0, I0, E0. And now I'm going to go ahead and sample

each of these in turn. And so I'm going to initially sample D

from. Of the, given.

I've. Zero.

G zero. SL0 and S1.

6:18

Okay? And we'll talk about how that's done in

just a moment. And that might give me D one by zero D

zero. And now I'll continue and I might sample

I. So I'm going to sample I from probability

of I given D one. E zero, L zero, S, one, they end up with

I one. Finally I sample G and it couldn't be G0,

I apologize because G only goes one, two, three.

So let's say it was G1 and now I sample G, from the probability of G.

Okay, so this was G1 here too, and this is the propability of, of mouse sampling

G, from the propability of G, given, D1, I1 that's where I am right now, L0 and

S1. And double G3, for the new finer.

Use this one. D1, i1, g3.

So how do I do this in a tractable way? So here is the trick that.

I mean, you might say, well. You're dealing with this intractable

distribution. I mean, you just moved the problem from

the original problem that we had to computing these conditional

probabilities. Why is that any easier?

Well, turns out it is. So let's look at the sampling stuff over

here where I sample'xi' from this conditional distribution and let's break

that down into these into its constituent definitions.

So this distribution, this conditional distribution is a ratio.

Between the join distribution of the value XI and the stuff on the right hand

side of the conditioning bar, divided by the stuff on the right hand side of the

conditioning bar. Now importantly, because each of these

is, is, an unnormalized measure divided by the partition function, the partition

function cancels and so I end up so if I had a one over z here.

The one over z cancels then I end up with the ratio to unnormalized measures, so,

the point being now, that this. Is a complete assignment.

9:00

And therefore the product of factors. So let's look at that in the context of a

concrete example, and see how that might help us.

So now we're doing an example that's a Markov network in order to demonstrate

that this is equally applicable to both. So this is our good old misconception

network where we have these four factors Pi one, Pi two, Pi three, and Pi four.

And let's imagine that we're trying to do P Pi of A given B, C, and D.

And we can see that what we have here is P tilde Pi of A, B, C, D divided by the

sum over A prime. And I'm using A prime to distinguish it

from this A over here, just to avoid ambiguity.

P tildify of sum over A prime, P tildify of A prime BCD.

And now I'm going to. Since each of these [INAUDIBLE] is a

product of the factors, I'm going to inject product of factors into each of

the numerator and denominators separately.

And so over here I end up with five one of ab times five two bc five three cd

five four ad. And similarly end up with exactly the

same expression in the denominator. Except I sum out, over A prime.

11:16

Well that's for the numerator, what about the denominator?

Well the denominator is just our good old.

Normalizing constant. So really what we have here is that this

is proportional to, Pi one, of A, little B times pi four of A, little D.

So this is a factor over A that I can compute, by multiplying two, singleton

factors in this case over A. Multiply them together and renormalize

and I get a distribution over a, from which I can sample, and we already talked

about how to sample from a discrete distribution.

So what is the computational cost, now, of this algorithm?

The sampling step for xi, given all of the others is simply this expression,

over here. Which is proportional to a product of

factors. And not just any old product of factors,

but just the factors that involves XI involve.

12:39

Just the factors that have XI in its scope.

Which means that I only care about XI and its neighbors in the graph.

Which, for a large graph, with tens of thousands of nodes, can be a considerable

savings, okay?

So only XI and its neighbors. So, do we like the Gibbs chain?

We've certainly defined it. Does it have the properties that we,

would. Like it to have?

Well, the answer is unfortunately, not always.

So, let's look at and specifically is a, is a regular change and we have

guaranteed convergence for unique station redistribution.

So the answer is unfortunately not and to that we're going to return to our old

enemy the exclusive-or network which as I said in other examples is the counter

example to pretty much anything is exclusive-or.

So let's look at the exclusive-or network.

So here we have Y which is the exclusive or of X1, X2.

And let's imagine that I observe. Y equals one.

And so these two disappear. And now I'm going to try and sample.

X one and X two, which I should be able, you know, if, if, if I were so lucky, I

would get the, each of the two possible states for X one and X two occurs with

probability a half. So let's imagine that I start out with

this one. And I'm going to now sample X one given X

two and Y. So I have my initial state is zero one

for X one and X two. And of course Y equals one is my

evidence. And now I'm going to sample.

what I'm going to sample, X one given X two and Y.

And that is a deterministic dependence, and the only possible value that I get is

zero for X one. Well, am I going to have any better luck

with X two? Nope.

I'm going to end up with exactly the same zero one that I started out with.

And I can do the exact same thing again, and you know what?

It'll work the same way every single time, and I will always end up with zero

one. Conversely, if I start out with one zero,

I will never leave one zero. This is a classical example of a non

mixing chain. On the other hand, if all factors are

positive, that is, if there are no zero entries in any of the factors, then you

can show that the Gibbs chain is regular. Now, however, it's important to realize

that regular doesn't mean good. So, I can make this chain regular by

adding a tiny probability epsilon of having y not be the exact exclusive orb X

one and X two. Now the chain is regular, because all

possibilities have probability non zero. And it's still going to mix extremely

slowly for values of epsilon that are small.

So regularity remember, is a very weak condition.

It says if you sample long enough, we will eventually converge.

But it doesn't mean it'll happen in a reasonable time frame.

16:16

And one such example of a very slow to mix chain actually occurs in practical

applications. So, here's one example.

Here, it's an example in our image segmentation domain.

And imagine that we're trying to resample, say, the class of.

let me use a different color. Of this super pixel here, and that we

have some fairly strong potentials that tell me that the class of this super

pixel, should be very highly correlated with the class of its adjacent super

pixels, because of the appearance being, very similar to each other.

Because I have some set of peer wise Markov network with some very strong

potentials that try and enforce consistency between adjacent super

pixels. So if it turns out that for whatever

reason, my initial assignment was actually wrong, so for example what is

often an error mode for set segmentation models is that is that road is often

labeled water or sky because the appearance is actually kind of a similar

and grayish kind of appearance so say that for, for example if I happen to

label all of these superpixels as say water, and now I'm going to use Gibbs

sampling. None of them are likely to move away from

their current assignment because each of the superpixels is really constrained by

strong pairwise potentials with its neighbors.

And so this notion of slow to mix Markov chains is not just a theoretical

construct that occurs with exclusive or it actually happens in practice a fair

amount. Now to summarize what Gibbs sampling does

is it takes the very challenging problem of inference in a Gibbs distribution.

And converts it to a sequence of easy sampling steps, where each sampling step

samples one variable, given all of the others.

This method has the advantage that it's probably the simplest Markov Chain for

probabilistic graphical model and it's very efficient to compute each of the

different stuff. The cause is that they are very slow to

mix, when the probabilities are peaked and the variables are very highly

correlated. And, it's only applicable in cases where

we can sample from a product of factors. Now, we can always do that in a discreet

graphical model, if our factors aren't too large.

But, if our samples are, but if we have, for example, either a very densely

connected Say Markov network or we have one with

continuous distributions where the product the factors isn't something has a

nice parametric form that we can sample, [INAUDIBLE] sampling might not be

actually applicable. And so there is non-trivial limitations

to the applicability of gift sampling and even when it applies, it is actually a

very slowed mix. And we often want to do something more

clever, and we'll talk about more clever algorithm later on.