0:03

So, let's look into

Â the details of the E-step or expectations step as people usually call it.

Â Recall that on the previous video,

Â with direct variation lower bound,

Â which is a lower bound for the log likelihood which

Â we want to maximize at any given point theta,

Â and it also depend on the variational parameter Q,

Â which itself is a distribution.

Â And on the E-step,

Â we want to maximize this lower bound with respect Q,

Â while fixing theta to be some particular value.

Â So, we can illustrate this situation like this.

Â We have our blue curve,

Â the log likelihood, and we have family of low bounds depending on different Q,

Â and we want to choose the one that has the highest value

Â of the particular current point theta K,

Â which basically means that we want to minimize

Â the gap between the lower bound and the blue curve.

Â So, let's derive an expression for this gap by using a blackboard.

Â So, let's look closer into this gap between

Â the marginal log likelihood and the lower bound.

Â First of all, we can decompose the marginal log likelihood

Â with respect to individual objects as we usually do.

Â So, we assume that the data-set consists

Â of the objects that are independent given the parameters.

Â We have this thing,

Â and then we can recall the definition of the lower bound,

Â which we have just derived in the previous video.

Â So, this marginal lower bound is sum with respect to objects.

Â Then sum with respect to the various latent variable,

Â for example, from one to three,

Â in the case of three Gaussians.

Â Then the variational distribution Q of ti equals to the value,

Â and times the logarithm of the ratio.

Â Then join distribution P of xi and ti,

Â given the parameters, and the variational distribution Q.

Â So, let's look how can we simplify this expression.

Â Well, first of all,

Â we may notice that sum with respect to the objects,

Â it appears in both terms of this expression,

Â so we can took the sum outside of the terms.

Â It'll look like this. So, sum with respect to the objects,

Â there's a big break.

Â Then when we try to also push these logarithm

Â of the marginal log likelihood of

Â individual object inside the summation respect to C. So,

Â to put logarithms close together.

Â And to do that, let's multiply this logarithm by one.

Â Basically, by the sum of the probabilities of the variation distribution Q.

Â So, this expression is just one because it's variant probability distribution.

Â This thing always equals to one.

Â And now, since logarithm of the marginal log likelihood doesn't depend on C,

Â we can put this thing inside the summation.

Â So, we can put it here.

Â And, this will not change anything.

Â And finally, we can rewrite the second part, the second term,

Â the sum with respect to the variance,

Â times again Q, and that's the logarithm of the ratio.

Â 3:50

And then, we can put this summation with respect to the values of the

Â latent variable outside of all the expression,

Â so we'll have sum with respect to the objects in the data-set,

Â sum with respect to the values of latent variables from one to three for example,

Â the weights from the variation distribution times the difference between the logarithms.

Â So, logarithm of the marginal log likelihood

Â minus logarithm of the ratio

Â of the joint distribution,

Â xi and ti divided by the variation distribution Q.

Â And, since the logarithm has the property that

Â difference between the logarithms is logarithm of the ratio,

Â we can rewrite this whole expression like this.

Â So, it equals to the sum with respect to objects,

Â sum with respect to the values,

Â getting weights from the variation distribution Q,

Â times the logarithm of the ratio.

Â So, logarithm of the marginal likelihood P of xi,

Â given parameters theta, divided by this ratio.

Â So, divided by the joint distribution P of xi and ti,

Â given parameters and this thing should be divided by Q,

Â but we can put this Q in numerator,

Â because it's like division twice.

Â So, now to simplify this thing,

Â we can notice that, by the definition of conditional probability,

Â this part equals to probability of ti equals C given the data,

Â given xi and theta,

Â times the prior distribution P

Â of xi, given theta.

Â And so, these two terms vanish because they appear both in numerator and denominator.

Â And finally, we have an expression like this.

Â I have sum with respect to the objects,

Â sum with respect to the values of latent variable,

Â the variance of the variational distribution Q, times logarithm of Q,

Â divided

Â by the distribution of ti.

Â So, probability of ti given C equals to C, given that the data-point xi,

Â and the parameter theta.

Â So, look closer to this final expression.

Â This thing exactly equals to the cubicle labor diversions between the two distributions.

Â So, this is a KL-divergence

Â between Q of ti,

Â and the posterior distribution P of ti given C equals to C,

Â given xi and theta.

Â So, to summarize what we have just derived,

Â the gap between the marginal log likelihood and

Â the lower bound equals to the sum of Kullback-Leibler divergences.

Â So, this thing could be sum with respect to the objects in

Â the data-set of KL divergences between Q of ti,

Â and the posterior distribution.

Â And, we want to maximize this lower bound with respect to theta.

Â So, we want to push this lower bound as high as possible,

Â maximize it with respect to, I'm sorry, not theta but Q.

Â Maximizing this expression with respect to Q,

Â is the same as minimizing the minuses expression, right?

Â So, it's the same as minimizing this thing.

Â And, note that the marginal log likelihood doesn't depend on Q at all.

Â So, we can as well minimize this difference.

Â So, maximizing the lower bound is the same as minimizing this whole difference,

Â and minimizing this difference is the same as minimizing this sum of KL-divergences

Â because this is what this difference is.

Â So, maximizing the lower bound is the same as minimizing

Â the sum of the KL-divergences with respect to Q.

Â And, recall that KL-divergences has two main properties.

Â So, first of all, they are always non-negative,

Â and second, for they equal to zero whenever the distribution coincides.

Â So, whenever, this and this two distributions are the same which means that we,

Â by setting Q to be the posterior distribution,

Â so Q of ti equals to the posterior,

Â we will optimize this,

Â we'll minimize this sum to zero,

Â so to the global optimal.

Â This sum cannot be ever lower than zero.

Â So, whenever we are at zero,

Â we found the global optimal,

Â which means we maximize the lower bound to the global optimal as well.

Â So, to solve the problem on the E-step,

Â we just have to set the variation distribution Q to be

Â the posterior distribution on the latent variable ti given the data and the parameters.

Â So, to summarize, the gap between the log likelihood and the lower bound where

Â half equals to the sum of Kullback-Leibler divergences within the distribution Q,

Â in the posterior distribution P of ti

Â of the latent variable given the data we have and the parameters we have.

Â Which basically means that,

Â if you want to maximize this lower bound,

Â it's the same as minimizing minus lower bound,

Â and since log likelihood doesn't depend on Q,

Â it's the same as minimizing this difference,

Â the left hand side of the expression,

Â and finally it's same as minimizing this sum of Kullback-Leibler divergences.

Â And as we know, Kullback-Leibler divergences are non-negative,

Â and they equal to zero whenever the distributions coincide,

Â whenever they are the same,

Â which means that we can minimize this thing to the optimal value by

Â just setting Q to be the posterior distribution of ti given the data.

Â So, this is our optimal solution to the E-step.

Â So just use, just set Q to be posterior with the current values of the parameters,

Â and it minimizes the gap to be zero,

Â because KL distance now is zero,

Â and so the lower bound becomes accurate at the current point.

Â The gap is zero,

Â so the value of the lower bound equals to the value of the log likelihood.

Â