0:00

[MUSIC]

Let's see how the Metropolis-Hastings can work in this simple, one-dimensional case.

Again, it's not that useful in one dimension, but

it's chosen just for illustrational purposes.

So let's say we want to sample from this two model distribution of the blue curve.

And let's say we start with the orange point, which is at the position 0 and

it's at the iteration 0, so it's our initialization.

Let's see how the interpretation works here.

First of all, let's choose the proposal distribution Q.

And to make everything simpler,

let's just use standard normal centered around the current point.

So at each duration, the mark of chain Q will propose to move

somewhere around the current point with some small variance 1.

And for example the first duration,

the Markov chain may propose this point, which is the right one.

So now we have to compute the acceptance probability, so

whether we want to accept this point or not.

This the definition of the acceptance probability, we have just proved

in the previous video, but note that in this case the ratio we have.

So the Q(x' -> x), and the Q(x' -> x), just the same.

So it is the property of our current proposal illustration Q,

that it doesn't depend on the order of arguments.

Which means that we can just vanish this Q, and what is left is just

the ratio of the densities at the new point, and at the previous one.

Which kind of makes sense, it just says that if the new point is more

probable than the previous one, we'll definitely accept it with probability 1.

If it's not the case, then we'll think like may be we'll accept it, maybe we'll

may be not depending on how less probable the new point is than the previous one.

So in this particular case, for this particular red dot proposal,

the new point is much more probable than the previous one.

It's like almost four times more probable which means that the probability

of acceptance here is 1 and we'll definitely keep this point.

So the end of situation 1 is moving to this new point.

Okay what about iteration two?

Let's sample again a point from the proposal and

we'll be somewhere around here for example.

Again computing the acceptance probability,

here we moved a little bit to the hard dense division.

It's not that much higher, but since it's higher then we will definitely accept

this point, so with probability 1 we're keeping at this point, okay.

New proposal, this one is really, it's trying to move

to a really non probable region according to David Lugov, so

we'll keep this point with probability point 13,

and now we're flipping a biased coin with probability for

2 in which tells us to accept this point and

with the probability 87 to reject this point.

So we flipped our biased coin, and

it told us to in this particular example, to reject this point.

Okay, why not?

Another proposal, it asks us to move here.

And this thing is much more probable than it appears, 1.

And we will keep this point with probability 73.

So most definitely keep this 1.

Again, we are flipping a biased coin, and now it tells us again to reject the point.

Well, why not, it happens sometimes.

It was more probable to accept it but we happen to reject it.

So again we are staying at the same place we were at the previous iteration.

And finally if you repeat this process for long enough, you will get a log like this.

So you will move In your sample space and here we have like 50 iterations and

sometimes you will stop it the same place and this plot will become flat.

But generally you will move around and

you can plot a histogram of the journey at points and

it looks kind of close to the Gaussian you want to sample from.

So in this case we didn't get exactly the samples we wanted,

the histogram is not exactly like the Gaussian.

I'm sorry not the Gaussian, but the blue curve, but it's close, so

it's a reasonable way to sample from the blue curve in this particular case.

Now the question is what happens if we change our proposal.

So let's say we use a Gaussian with less variance.

So we propose always to move just

with tiny little steps around the previous point.

Well, it kind of works, but in [INAUDIBLE] situations it doesn't

proceed to move outside the low density region where it started.

So in 50 iterations, it haven't converged yet, and

it will definitely converge at some point but we don't know where.

And this means that it's much less efficient choice here to use

these small steps.

What about large steps?

Well, if we increase the variance of our proposal distribution Q to be 100,

then we'll get large steps which is nice in terms of convergence and

uncorrelated samples.

But it will stay at the same place for really long periods.

Because it will be often that we state the nice place at the high probable place and

the mark of chain Q proposes us to move far away from it and we we don't like it.

So we stay where we were.

And this means that our actual symbols will be kind of correlated,

because we always stay at the same place and we waste the resources and

capacity of our computers.

I want to share with you one more thing about the Metropolis Hastings approach,

it's a really cool perspective on it which tells us that

Metropolis Hastings can be considered as a correction scheme.

So if you have a slightly wrong version of your assembly color theme,

you can correct it with Metropolis Hastings.

Let's look at one example, so recall the Gibbs scheme, the Gibbs sampling.

We used to assemble points at one dimension at a time and

the Gibbs scheme is inherently not parallel, so we'll have to

know all the information from the previous sub steps to make the next sub step.

Okay, let's try to make it parallel [COUGH]

let's briefly give sampling scheme and

use the information from the previous situations.

So the sub-steps will not depend on each other.

Briefly, If we have million-dimensional space, and if we have million computers,

we can do every sub-steps in parallel, and we'll be really, really fast.

But the problem is that we broke our scheme, we no longer have any conversion

guarantees, that it will convert to the true distribution which we want.

And it will not, actually.

So it will generate samples from some wrong distribution.

What can we do here?

Well, we can use this thing as a proposal distribution for the Metropolis Hastings

and then correct it with the from the Metropolis Hastings approach.

And since this particular proposal distribution is not arbitrary,

it's already almost right.

Almost generating your points from your desired distribution.

Then you will not reject too many points.

Because this is already almost the thing that you want to do.

So, you will just occasionally reject one point or another.

But generally it will, it may be much much more efficient overall,

because now you can do give subject a parallel.

So to summarize.

Metropolis Hastings approach is rejection sampling idea applied to Markov Chains.

And the nice thing about this approach is that it gives you a whole

family of Markov Chains and you can choose the one that you like.

Another bright property of this algorithm is that it works for

normalized probabilities and densities, as well as the Gibbs sampling,

and then it's also kind of easy to implement.

It is a bit harder than the Gibbs sampling maybe, but

anyways like five lines of code.

And the negative features are that samples are less correlated, but

still somewhat correlated.

And also, the property that it gives you a whole family of

Markov Chains is kind of a two-sided thing.

So first of all, it gives you flexibility to choose the right thing for

your particular problem.

But second of all it forces you to think, it forces you to choose something.

And may be hard in some cases, and harder to automate.

So if something works always, you don't always have to think at any point.

You can just automate the whole process.

And to apply in you usually have to think a little bit and

understand which proposal will make sense for your particular obligation.

[MUSIC]