And if the chain is kind of ambling along in this part of the space and never

hitting the other part of the space. Then, you're still going to get very

similar probabilities across a window of a single run of the chain.

Because all of the samples are taken from this part, and none of the are ever taken

from that part. And so we don't know that we have them

mixed. So, a more reliable statistic is to take

these, more reliable evaluation criterion is to take these statistic across

different runs that are initialized in different parts of the space.

And then you might hope that one chain is traversing,

one run of the chain is traversing this region, and another run of the chain is

traversing this region. And so now the statistics would show a

difference and indicate that mixing hasn't taken place.

So, what statistics might, So how, what might we do it, more

concretely? So let's look at two examples.

Here is, two example runs of a chain that we'll describe a little bit later.

and what we measure here. And this is the first statistic to

measure in Markov chains, is the log of probability of of a sample.

So you compute, The log probability of sampling.

You can't always compute the log probability directly.

You might compute an un-normalized log probability, as we'll talk about, as

we'll talk about later. But basically, you compute the log

probability, or some constant factor thereof.

And now, you can compare two runs. And this is a run that's initialized from

an arbitrary state. And this is one that's initialized from a

high probability state. And you look at those, and you say, has

it mixed, relative to this criterion? The answer is maybe.

It looks okay, but you're not entirely sure.

But we can look at other statistics. Oh, sorry.

Let's look at another example. What about this one?

Here is exa, again an example of two runs, one of which is initalized from an

arbitrary state, and initialized from a high probability state and you can see,

that the log propability values are no where close to each other and so in this

case the answers is definetely not, these are two runs of the chain, and really,

neither is mixed. And so you need to run for a lot longer

which, you know, this comes out, this goes up to 600,000 so it kind of indicate

to you how much time this might take. A different statistic, a different way of

looking at this, is for a different kind of statistic.

So now we have for example, the probability relative to a window that we

compute in the chain. So remember, all of this is relative to a

window in the chain, after we hope that mixing has taken place.

And now we compute the probability that, within states in this region, what is the

probability of, that, the states are in some sets, so for

example the set of states where. X3 is equal to two.

And now we compute that statistic using the two initializations of the chain, so

this is chain one, or run one, and this is run two.

And now we do a scatter plot for for different statistics.

So each of these points. This is the probability say of X3 equals

X3 equals value two. This might be the probability that X1 is

equal to zero. This might be some other probability

that, you know, X5 is equal to seven. And so each of these is a statistic and

what you see here is a scatter plot. One is the estimate that they get from

the one chain and from the other, and looking at this It should be obvious that

this first chain, has not, that, we have not got mixing on the left hand side,

because you can see that there are all these points here that have high

probability in one of the two rungs, and a probability zero in the other, and vice

versa. Whereas here, most of the estimates are

clustered around the diagonals. So that you're getting similar estimates

from the two chains. And so again,

this one is, the first one is a clear no and the second is maybe.

And if I do a lot of these statics and they all come up with maybe then I'm

willing to then trust the answers and the that is taking place.

So now that I've started collecting samples, how do I use these samples?

Well, one important observation to keep in mind is that once the chain is mixed,

all of my samples are from stationary distribution.

That is xt is from pi, so as xt1, + 1, t2, + 2, t + 3 and so on and so forth.

And so we can use every single one of those samples.

Because they are all from the correct distribution.

So once I've determined that a cert-, that I've sampled long enough for mixing,

or believe that I've sampled long enough for mixing.

We should collect and use all of the samples.

And in fact, there are you might read some papers that say.

Oh, I'm only going to collect every hundredth sample.

There's actually papers that prove that using every single sample is better than

collecting every hundredth sample. But then you might ask, well why would

those papers tell me that I should only collect every 100 samples as opposed to

collecting all the samples if they're all from the correct distribution?

Because, and this is undoubtedly true, adjacent samples, ones that are near by

to each other in time are correlated with each other.

Because how, even if xt is from the right distribution Pi and so is xt + 1.

Xt + 1 is still going to be close to T, close to xt.

And so you're not really getting two different samples.

You're getting two that are very close relatives to each other.

Now as I said it's important to recognize that phenomenon.

Because it's important to realize that just because you've reached mixing and

collected a thousand samples, doesn't mean you have a thousand samples worth of

information. So you shouldn't go back and apply the,

apply the you know, one of the bounds that we saw in assuming that you have iad

samples. The samples are not.

I ID. So that, that doesn't mean you shouldn't

use them, using them is still better than not.

Now, this is where you get bitten twice by the same phenomenon.

The worse a chain is to mix. So the longer you need to wait for the

initial distri, for the initial samples to be good enough, the more correlated

the samples are because the slower you are moving around in the space in

general. And so if a chain is bad, it's bad in two

different ways. It's bad because it takes you longer to

mix, and it's bad because the samples you are collecting are not as useful.

because of the correlation structure between the samples.