0:03

So, let's look into

the details of the E-step or expectations step as people usually call it.

Recall that on the previous video,

with direct variation lower bound,

which is a lower bound for the log likelihood which

we want to maximize at any given point theta,

and it also depend on the variational parameter Q,

which itself is a distribution.

And on the E-step,

we want to maximize this lower bound with respect Q,

while fixing theta to be some particular value.

So, we can illustrate this situation like this.

We have our blue curve,

the log likelihood, and we have family of low bounds depending on different Q,

and we want to choose the one that has the highest value

of the particular current point theta K,

which basically means that we want to minimize

the gap between the lower bound and the blue curve.

So, let's derive an expression for this gap by using a blackboard.

So, let's look closer into this gap between

the marginal log likelihood and the lower bound.

First of all, we can decompose the marginal log likelihood

with respect to individual objects as we usually do.

So, we assume that the data-set consists

of the objects that are independent given the parameters.

We have this thing,

and then we can recall the definition of the lower bound,

which we have just derived in the previous video.

So, this marginal lower bound is sum with respect to objects.

Then sum with respect to the various latent variable,

for example, from one to three,

in the case of three Gaussians.

Then the variational distribution Q of ti equals to the value,

and times the logarithm of the ratio.

Then join distribution P of xi and ti,

given the parameters, and the variational distribution Q.

So, let's look how can we simplify this expression.

Well, first of all,

we may notice that sum with respect to the objects,

it appears in both terms of this expression,

so we can took the sum outside of the terms.

It'll look like this. So, sum with respect to the objects,

there's a big break.

Then when we try to also push these logarithm

of the marginal log likelihood of

individual object inside the summation respect to C. So,

to put logarithms close together.

And to do that, let's multiply this logarithm by one.

Basically, by the sum of the probabilities of the variation distribution Q.

So, this expression is just one because it's variant probability distribution.

This thing always equals to one.

And now, since logarithm of the marginal log likelihood doesn't depend on C,

we can put this thing inside the summation.

So, we can put it here.

And, this will not change anything.

And finally, we can rewrite the second part, the second term,

the sum with respect to the variance,

times again Q, and that's the logarithm of the ratio.

3:50

And then, we can put this summation with respect to the values of the

latent variable outside of all the expression,

so we'll have sum with respect to the objects in the data-set,

sum with respect to the values of latent variables from one to three for example,

the weights from the variation distribution times the difference between the logarithms.

So, logarithm of the marginal log likelihood

minus logarithm of the ratio

of the joint distribution,

xi and ti divided by the variation distribution Q.

And, since the logarithm has the property that

difference between the logarithms is logarithm of the ratio,

we can rewrite this whole expression like this.

So, it equals to the sum with respect to objects,

sum with respect to the values,

getting weights from the variation distribution Q,

times the logarithm of the ratio.

So, logarithm of the marginal likelihood P of xi,

given parameters theta, divided by this ratio.

So, divided by the joint distribution P of xi and ti,

given parameters and this thing should be divided by Q,

but we can put this Q in numerator,

because it's like division twice.

So, now to simplify this thing,

we can notice that, by the definition of conditional probability,

this part equals to probability of ti equals C given the data,

given xi and theta,

times the prior distribution P

of xi, given theta.

And so, these two terms vanish because they appear both in numerator and denominator.

And finally, we have an expression like this.

I have sum with respect to the objects,

sum with respect to the values of latent variable,

the variance of the variational distribution Q, times logarithm of Q,

divided

by the distribution of ti.

So, probability of ti given C equals to C, given that the data-point xi,

and the parameter theta.

So, look closer to this final expression.

This thing exactly equals to the cubicle labor diversions between the two distributions.

So, this is a KL-divergence

between Q of ti,

and the posterior distribution P of ti given C equals to C,

given xi and theta.

So, to summarize what we have just derived,

the gap between the marginal log likelihood and

the lower bound equals to the sum of Kullback-Leibler divergences.

So, this thing could be sum with respect to the objects in

the data-set of KL divergences between Q of ti,

and the posterior distribution.

And, we want to maximize this lower bound with respect to theta.

So, we want to push this lower bound as high as possible,

maximize it with respect to, I'm sorry, not theta but Q.

Maximizing this expression with respect to Q,

is the same as minimizing the minuses expression, right?

So, it's the same as minimizing this thing.

And, note that the marginal log likelihood doesn't depend on Q at all.

So, we can as well minimize this difference.

So, maximizing the lower bound is the same as minimizing this whole difference,

and minimizing this difference is the same as minimizing this sum of KL-divergences

because this is what this difference is.

So, maximizing the lower bound is the same as minimizing

the sum of the KL-divergences with respect to Q.

And, recall that KL-divergences has two main properties.

So, first of all, they are always non-negative,

and second, for they equal to zero whenever the distribution coincides.

So, whenever, this and this two distributions are the same which means that we,

by setting Q to be the posterior distribution,

so Q of ti equals to the posterior,

we will optimize this,

we'll minimize this sum to zero,

so to the global optimal.

This sum cannot be ever lower than zero.

So, whenever we are at zero,

we found the global optimal,

which means we maximize the lower bound to the global optimal as well.

So, to solve the problem on the E-step,

we just have to set the variation distribution Q to be

the posterior distribution on the latent variable ti given the data and the parameters.

So, to summarize, the gap between the log likelihood and the lower bound where

half equals to the sum of Kullback-Leibler divergences within the distribution Q,

in the posterior distribution P of ti

of the latent variable given the data we have and the parameters we have.

Which basically means that,

if you want to maximize this lower bound,

it's the same as minimizing minus lower bound,

and since log likelihood doesn't depend on Q,

it's the same as minimizing this difference,

the left hand side of the expression,

and finally it's same as minimizing this sum of Kullback-Leibler divergences.

And as we know, Kullback-Leibler divergences are non-negative,

and they equal to zero whenever the distributions coincide,

whenever they are the same,

which means that we can minimize this thing to the optimal value by

just setting Q to be the posterior distribution of ti given the data.

So, this is our optimal solution to the E-step.

So just use, just set Q to be posterior with the current values of the parameters,

and it minimizes the gap to be zero,

because KL distance now is zero,

and so the lower bound becomes accurate at the current point.

The gap is zero,

so the value of the lower bound equals to the value of the log likelihood.