0:00

We previously defined the notion of a Markov chain that allows to generate

samples from an intractable distribution. But we left unanswered the question of,

assuming we have a Markov chain that has the desired stationary distributions.

How do we go about using it in the context of a concrete algorithmic

procedure? Well let's imagine that somebody has

provided us a Markov chain. for, a distribution, sorry.

The, imagine that our goal is to compute some probability relative to a

distribution P. But, for whatever reason, P is too hard

to sample from directly. And so how do I use the Markov chain to

get around that? First, we construct the Markov Chain, p,

whose unique stationary distribution is P. SO so it needs to be a regular Markov

chain, usually. And then we're going to generate samples

from this distribution. So we're going to start out by sampling

my, initial state from some arbitrary distribution, p0.

P0's not the same as P. So obviously I can't use the zero sample

as a way of estimating an event relative to P.

And so what I do now is I start walking along the chain.

Starting sampling xt + 1 from the transition model.

And because the chain converges to a stationary distribution eventually I'll

have a sample that is very close to to the to be in sample from the

distribution P that I care about. So, now, here's the sticky issue.

And this is really a sticky issue. when have we walked long enough that we

could actually use our samples? Because we don't want to use samples at

the beginning, because they're far from P.

and so we need to walk around long enough for the chain to get close to its

stationary distribution. And that state is called mixing.

So mix is when we, is when P of t is close enough.

2:14

Pi. So that your time T distribution is close

to the stationary, and that's the point that which I could say, good enough, I

can start collecting samples. So how do we know if a chain has mixed or

not? And the short answer is you don't.

and so this is, as I said, the sticky part of Markov chain methods.

So in general, you can never really prove that a chain has mixed.

But in some cases you can show that it hasn't mixed.

And then if you run lots and lots of tests, and none of them have convinced

you that the chain hasn't mixed, then you sort of resign yourself to assuming that

it has, in fact, mixed. So how do you convince yourself that a

chain hasn't mixed? You compute chained statistics, and we'll

show some examples of that in a moment, in different windows within a single run

of a chain. So for example, here I have a single run

of the chain. And I run for a while And then I look at

little windows, usually they won't be this small and I compare this window.

This window computing various statistics. For example, what is the probability of

hitting a particular state? And I say, is the probability of that in

this window close to the probability of that in this window?

And if it is, then maybe I've reached at least some kind of convergence point,

in terms of where the chain is reached. Now this of course is not a sufficiently

good answer because it could also be derived in a chain that has these sort of

two very high probability regions but are very hard to get from each other.

4:15

And if the chain is kind of ambling along in this part of the space and never

hitting the other part of the space. Then, you're still going to get very

similar probabilities across a window of a single run of the chain.

Because all of the samples are taken from this part, and none of the are ever taken

from that part. And so we don't know that we have them

mixed. So, a more reliable statistic is to take

these, more reliable evaluation criterion is to take these statistic across

different runs that are initialized in different parts of the space.

And then you might hope that one chain is traversing,

one run of the chain is traversing this region, and another run of the chain is

traversing this region. And so now the statistics would show a

difference and indicate that mixing hasn't taken place.

So, what statistics might, So how, what might we do it, more

concretely? So let's look at two examples.

Here is, two example runs of a chain that we'll describe a little bit later.

and what we measure here. And this is the first statistic to

measure in Markov chains, is the log of probability of of a sample.

So you compute, The log probability of sampling.

You can't always compute the log probability directly.

You might compute an un-normalized log probability, as we'll talk about, as

we'll talk about later. But basically, you compute the log

probability, or some constant factor thereof.

And now, you can compare two runs. And this is a run that's initialized from

an arbitrary state. And this is one that's initialized from a

high probability state. And you look at those, and you say, has

it mixed, relative to this criterion? The answer is maybe.

It looks okay, but you're not entirely sure.

But we can look at other statistics. Oh, sorry.

Let's look at another example. What about this one?

Here is exa, again an example of two runs, one of which is initalized from an

arbitrary state, and initialized from a high probability state and you can see,

that the log propability values are no where close to each other and so in this

case the answers is definetely not, these are two runs of the chain, and really,

neither is mixed. And so you need to run for a lot longer

which, you know, this comes out, this goes up to 600,000 so it kind of indicate

to you how much time this might take. A different statistic, a different way of

looking at this, is for a different kind of statistic.

So now we have for example, the probability relative to a window that we

compute in the chain. So remember, all of this is relative to a

window in the chain, after we hope that mixing has taken place.

And now we compute the probability that, within states in this region, what is the

probability of, that, the states are in some sets, so for

example the set of states where. X3 is equal to two.

And now we compute that statistic using the two initializations of the chain, so

this is chain one, or run one, and this is run two.

And now we do a scatter plot for for different statistics.

So each of these points. This is the probability say of X3 equals

X3 equals value two. This might be the probability that X1 is

equal to zero. This might be some other probability

that, you know, X5 is equal to seven. And so each of these is a statistic and

what you see here is a scatter plot. One is the estimate that they get from

the one chain and from the other, and looking at this It should be obvious that

this first chain, has not, that, we have not got mixing on the left hand side,

because you can see that there are all these points here that have high

probability in one of the two rungs, and a probability zero in the other, and vice

versa. Whereas here, most of the estimates are

clustered around the diagonals. So that you're getting similar estimates

from the two chains. And so again,

this one is, the first one is a clear no and the second is maybe.

And if I do a lot of these statics and they all come up with maybe then I'm

willing to then trust the answers and the that is taking place.

So now that I've started collecting samples, how do I use these samples?

Well, one important observation to keep in mind is that once the chain is mixed,

all of my samples are from stationary distribution.

That is xt is from pi, so as xt1, + 1, t2, + 2, t + 3 and so on and so forth.

And so we can use every single one of those samples.

Because they are all from the correct distribution.

So once I've determined that a cert-, that I've sampled long enough for mixing,

or believe that I've sampled long enough for mixing.

We should collect and use all of the samples.

And in fact, there are you might read some papers that say.

Oh, I'm only going to collect every hundredth sample.

There's actually papers that prove that using every single sample is better than

collecting every hundredth sample. But then you might ask, well why would

those papers tell me that I should only collect every 100 samples as opposed to

collecting all the samples if they're all from the correct distribution?

Because, and this is undoubtedly true, adjacent samples, ones that are near by

to each other in time are correlated with each other.

Because how, even if xt is from the right distribution Pi and so is xt + 1.

Xt + 1 is still going to be close to T, close to xt.

And so you're not really getting two different samples.

You're getting two that are very close relatives to each other.

Now as I said it's important to recognize that phenomenon.

Because it's important to realize that just because you've reached mixing and

collected a thousand samples, doesn't mean you have a thousand samples worth of

information. So you shouldn't go back and apply the,

apply the you know, one of the bounds that we saw in assuming that you have iad

samples. The samples are not.

I ID. So that, that doesn't mean you shouldn't

use them, using them is still better than not.

Now, this is where you get bitten twice by the same phenomenon.

The worse a chain is to mix. So the longer you need to wait for the

initial distri, for the initial samples to be good enough, the more correlated

the samples are because the slower you are moving around in the space in

general. And so if a chain is bad, it's bad in two

different ways. It's bad because it takes you longer to

mix, and it's bad because the samples you are collecting are not as useful.

because of the correlation structure between the samples.

13:12

And then finally, when I've collected enough samples, I can go ahead and

compute the expectation in anything that I care about, whether it's an indicator

function or a more complicated function. Summarizing the implications, Markov

chains are a very general purpose class of methods for doing approximate

inference in a, in general probabilistic models, not even necessarily

probabilistic graphical models. they're often very easy to implement

because the local sampling that we're doing is often quite straightforward to

execute. And, it has good theoretical properties

as we sample, as we generate enough samples that are sufficiently far away

enough from our starting distribution. Unfortunately, these theoretical

guarantees are very theoretical in many cases.

And so this method also has some significant cons.

First, it has a very large number of tunable parameters and design choices.

What's my mixing time? Which statistics do I measure?

How close are the statistics to each other in order for me to declare that

mixing is taking place? how many samples do I collect?

do I what window size do I use to evaluate mixing.

All of these are design choices, they all make a difference, and so there's a lot

of, finicky tuning that needs to be done when running an MC and C method in

practice. The second is that bit depending on the

design of the chain, this can be quite slow to converge because it's very

difficult to design chains that have good mixing properties.

And finally, it's very hard to tell whether a chain is working.

That is it's not straight forward to determine whether the chain has mixed.

and how whether my samples are sufficiently uncorrelated to each other

that I'm getting a reliable estimate of whatever it is that I'm trying to figure

out. So, a lot of advantages but also some

significant disadvantages.