In the previous module,

we derived the general form of Expectation-Maximization Algorithm.

In this module, we'll consider

a few concrete latent variable models and

discuss the details how to apply the Expectation-Maximization Algorithm to them.

To start with, let's revisit the Gaussian Mixture Model and see how

the algorithm which we derived from some kind of

intuitive considerations fits into this general framework of EM.

I recall that in Gaussian Mixture Model,

we have a dataset of points and we want to

model them with a mixture of Gaussians naturally.

So, the density of each data point is

a weighted sum of several different Gaussian densities.

And we have three types of parameters,

the weights, the Gaussian locations,

and the Gaussian covariance matrices.

On the E-step of the Expectation-Maximization Algorithm,

if we apply to this Gaussian Mixture Model,

we'll have to assign q to be the posterior distribution on the latent variable t,

given the data and the parameters.

It turns out that we did the same,

the exactly the same thing in the Gaussian Mixture Model,

just from intuitive considerations.

The E-step is the same as in the first subtype of

each iteration of GMM of EM applied to Gaussian Mixture Model.

Okay, so what about the M-step?

Well, on the M-step in the EM algorithm,

we have to maximize the expected value of

the joint log likelihood with their respective parameters,

while fixing q, while fixing q to be the posterior from the previous iteration.

And in the Gaussian Mixture Model,

we decided to use this kind of formulas so, for example,

mean of the first Gaussian will be a weighted average of

the data points with weights being the posterior distributions,

or q, in limitation of the EM algorithm divided by some normalization.

How does this to collect it?

Well, let's use the blackboard to derive the connection.

To the proof that the maximization of this expected theory of logarithm,

which EM algorithm asks us to do,

is the same as the formulas we kind of intuitively derived for the GMM.

Let's look at how can we apply the general form of

the Expectation-Maximization Algorithm to Gaussian Mixture Model.

On the M-step of the expectation maximization, applied to GMM,

Gaussian Mixture Model, who would like to maximize the full expression?

We want to maximize the sum with respect to objects,

maximize with respect to parameters, by the way.

Sum with respect to objects in the dataset,

then expected value with respect to the Hive operational distribution,

which we found on the E-step,

of the logarithm of the joint distribution,

logarithm of probability or density of x_i and t_i, our given parameters.

Let's look closer at this expression in the case of Gaussian Mixture Model.

This thing equals to the sum with respect to objects,

again, and then we write the expected value by the definitions,

so expected value is just sum with respect to all of the values.

If we have three Gaussians it will be from one to three,

of weights, of probabilities,

from the operational distribution q,

there's this log joint likelihood.

This logarithm of the joint likelihood is actually equals to the x_i given t_i,

which is a Gaussian.

So, we have our normalization for this Gaussian times exponent,

and let's say that we have one dimensional data,

so in this case, we have one dimensional Gaussian,

which looks like this, back sign minus mu_c.

The center of the cluster number c of

the Gaussian number c divided by two times the variance of the Gaussian number c,

and that's the prior,

times the probability of t_i equals to c,

which is by c. It's also a parameter which we won't optimize for.

Let's write this expression even more.

This will be, again,

sum with respect to objects.

Sum with respect to the values of a latent variable,

the weights times the logarithm of pi_c divided by z,

minus the logarithm of exponent of minus something,

which is just this something.

So, it's x_i minus Mu_c, oh I'm sorry,

I forgot the squared here,

squared, divided by two times the variance.

And let's optimize this expression,

for example, with respect to Mu_1.

The gradient of this whole function with respect to Mu_1 will be,

so first of all,

its sum with respect to objects,

as always, so we can push the gradient inside the sum.

Then we'll have sum with respect to the Gaussians,

but note that the only Gaussian that depends on Mu_1 is the first one.

All old terms corresponding to c equals two or c

equals three will have zero gradient with respect to Mu_1.

We'll have, we don't have any summation with respect to c here.

Instead, we assume that t_i equals to one because otherwise,

the gradient will be zero,

and we don't care,

times gradient of this expression with respect to Mu_1, which is zero,

because pi_c doesn't depend on new one,

and the normalization constant of the normal distribution,

of the Gaussian distribution,

also doesn't depend on the location,

on the mean value.

So, this would be zero minus,

well, here we have this expression.

First of all, we'll have the same normalization,

it's a constant, with respect to Mu_1,

and then times the gradient of this expression,

which is two times x_i minus Mu_c,

and times minus one.

It's a gradient of the complete of the, basically the general.

And we want to set this gradient to zero to find the optimal value of Mu_1.

This, two and two one-ish,

we can basically divide by two in numerator and denominator and finally,

we can multiply this whole expression by,

which should be, Mu_1 here and Sigma_1 because we assume that c is one.

We can multiply this whole expression by the variance of the first Gaussian,

which we actually don't know,

but since it's not zero,

we can multiply by it and so it will vanish.

And this way, we'll have finally,

the full length expression of this gradient equals to,

this gradient multiply it by Sigma_1 squared,

equals to the sum of the objects, the weights.

The third term here will be minus x_i so it will be minus x_i,

minus times minus give plus,

so it will be plus x_i,

and minus the second term,

sum with respect to objects, the weights,

so we're basically decomposing these two terms into two separate expressions,

times Mu_1, and this thing equals to zero.

You can note that here,

Mu_1 doesn't depend on i,

so we can put brackets here like this.

And finally, it's a linear expression with respect to Mu_1,

so we can write down that Mu_1 equals to large ratio.

This term, sum with respect to objects,

weights, and x_i's divided by sum normalization,

basically, sum with respect to all objects of all the weights.