So, we have just proved that the Gibbs sampling scheme indeed gives you

a correct way of sampling from the desired distribution.

So the underlying Markov chain indeed converges to the distribution B.

But let us look at a small demo of how it can work in practice.

So, let's look at this simple two-dimensional distribution which looks like a Gaussian.

And note that we actually don't need Gibbs sampling here.

Because first of all, this distribution is two dimensional.

So the rejection sampling scheme can also be applied here because the conditionality is

not large and also this is Gaussian.

So we can efficiently sample from it without too much trouble of any fancy techniques.

But this is only for illustrational purposes.

Gibbs sampling works even in million dimensional spaces with complicated distributions.

So let's start with point 0,0, initialization point.

So first of all, on the first substep of the first iteration,

Gibbs sampling suggests you to find the conditional distribution on X2 given X1 and 0

and it will look like this here because if we know that X2 is 0,

then X1 should be somewhere around

5 or 6 because of the shape of our Gaussian distribution.

It has more points in that region than in other regions for this particular area of X2.

Let's sample a point from that.

It can look like this for example.

Okay, so this is the first substep.

On the next substep,

we have to compute the conditional distribution on

X2 given that X1 is the one we have just generated.

So X1 is 6 or somewhere around there and

the conditional distribution on X2 will be like this.

Again, it is just the position of the points,

positions where our Gaussian distribution

lies if they are X1 according to 6.

So let's sample from that and get a point like this.

This is the result of our iteration one.

So this is the next point in our Markov chain process.

Note that here we have not converged yet.

We even did not find a region of space where the density of our distribution is less.

We are somewhere in the low density region.

So we cannot use these samples yet because they are not from the true distribution.

So let us proceed.

Next iteration, we again sample from the conditional distribution on X1 given that

X2 is around 1 and this is somewhere here.

Okay, let's sample from it and let's move to this point and this is

the end of our substep one of the second iteration and on substep two,

we will generate a point like this again from the conditional distribution.

So one more substep and the end of

the first iteration and finally if we repeat this process for like 10 more iterations,

we will get points like this which are already it looks like something from the Gaussian.

So, this was a demo of how Gibbs sampling works and to summarize its properties.

So first of all,

the Gibbs sampling idea is to,

instead of one complicated problem

of generating samples from multidimensional distribution,

which you may know after normalization constant,

it reduces to a sequence of one-dimensional sampling problems and this is very nice.

It makes everything more efficient.

But note that if, in the first video this week,

we discussed some schemes where we can generate a point and that is it,

we have just one sample by spending some time on generating it,

here we have a chain of samples.

So you cannot generate the first sample unless you for example converge.

So we have to wait for these first five hundred iterations and then throw

away the first samples to get just one sample that you care about.

Another positive feature of this Gibbs sampling is that it is really easy to implement.

So just a few lines of coding in pattern.

And it is really universally applicable.

So almost always you can sample from

one-dimensional conditional distributions and

this way you can sample from the multidimensional one.

And the negative feature is that it can give you really correlated samples.

So if you recall the demo,

all samples were kind of close to the previous one.

So you cannot do large steps here in the Gibbs scheme.

You are always doing some kind of local steps

because you are going one dimension at a time.

And this means that your samples are similar to each other which in turn means

that you are not that efficient in estimating expected values as you would wish.

So let us say that we waited for the first

1,000 iterations of the Gibbs sampling and throw these points away

saying that this was useless for us because the Markov chain did not converge yet.

And then after the 1,001 iterations,

we are starting to collect samples and write them

down to use for our expected value estimation.

And for example, we may say that we collected 1,000 samples.

Well we may think that we have 1,000 samples,

but they are so close to each other that in effect it is like you

had one hundred for example one hundred independent samples.

So the efficiency of estimating expected phases with correlated samples is much lower

than the efficiency of estimating the same thing with

the uncorrelated samples and this can be a problem.

So it is an efficiency problem.

One other negative feature is that because of the small steps,

it can be hard to converge.

The convergence here in Markov chains is sometimes called mixing.

So you have to do lots of steps to find

this high density region of your probability distribution

and you have to throw all these points

away because you have not converged yet and you cannot use these points.

So because of these small local steps,

you have to wait for convergence longer than you would have if the steps were large

and finally as we have already discussed,

this method is not parallel and if you want to

sample from like one million dimensional space,

you have to anyway sample each dimension one at a time,

which can be really slow to do.

So in the next video,

we will discuss some other schemes for building Markov chains that

can be used to mitigate some of these problems.