On the M-step of

the Expectation Maximization Algorithm for restricting the Gaussian mixture model,

we have to maximize the following expression.

So we have to find the maximum of sum with respect to individual objects in our data set,

training objects of expected values

with respect to the variational distributions q of t_i of

the joined log likelihood of x_i and t_i given the parameters,

and here with the parameter suggest mu,

so given mu and maximize this expression with respect to mu.

Mu one to mu number of clusters.

We can maximize this thing explicitly by finally get zero,

setting it to zero,

but recall that on the previous video,

we maximize the same expression for Gaussian mixture model.

We can now reuse the results from that video in this one.

Recall that the optimum parameter for μ number C,

for example, was the following one.

It was the average with respect to all the data

or all the data points now are training to data set times

the variational distribution q of the weights of

ti equals to C. The weights according to the card and so the Gaussian C,

that is the data points xi,

divided by the normalization.

So now, the sum of all the weights q of ti equals to C.

Recall that in our particular case, on the E-step,

we are using delta functions instead of the full variational distribution q.

Our q looks like this,

q of ti equals to one if ti equals to ci,

ci-star, which we have already pre-computed on

the E-step and zero otherwise.

We can continue this expression and

simplify it by using these definition of q to ti, in this particular case.

This thing will be just again sum with respect to all the subjects but all the objects

that doesn't have ti equals to ci-star or

to c. They will not contribute to the sum, right?

So any object which didn't came from the Gaussian number c,

it will have the weight q being zero,

both the numerator and denominator and so,

it will not contribute to this sum.

Thus, we can write down like this.

We can say that this thing will be equal to sum with respect to

all objects from the Gaussian number c. All i such

that ci-star equals to c will arrange

these data points and divide them by normalization by the number of such data points.

It's number of i such that ci-star equals to c,

and this is exactly the formula for the K-means algorithm on the second subset.

To summarize, we have just proven that

the M-step of the expectation maximization algorithm applied to

this restricted Gaussian mixture model without some parameters

where we leave just the locations of mu of the Gaussian parameters.

If we use the q,

the restricted q or being the delta function from the restricted E-step,

then on the M-step, we obtain this kind of formula.

We have to find,

so we have to update the mu as being the centers of the all the data points

assigned to the cluster number c. We use again exactly as we had in K-means.

So to summarize, K-means is actually

a special case for applying

the expectation maximization algorithm to Gaussian mixture model,

but first of all,

these Gaussian mixture model has to be some kind of restricted one where we said

covariance matrices are to be identical and the prior Gaussian c to be uniform.

We also have to use

simplified E-step so instead of considering

the full posterior distributions on the E-step,

we're approximating them with the delta function.

Which kind of makes E-step and M-step faster but also restrict

the flexibility of the model and efficiency of the EM algorithm.

But anyway, this kind of restriction still gives you

a valid EM algorithm which still is guaranteed to converge and has nice properties.

It's really cool. We have just took your usual machine learning algorithm for clustering,

which is not probabilistic in any case in any sense.

We show that it's

a special case of the EM algorithm applied to some restricted Gaussian mixture model.

Using this knowledge, we can, for example,

change something in K-means and we

know how to adapt the the training algorithm to these changes.

For example, we can introduce missing data.

This way, we can build the missing data K-means

without heuristics like setting missing data zero,

but by just extending

this restricted EM algorithm and handling these missing data naturally.

Actually, this kind of trick to simplify the E-step by considering

a restricted set of distributions

is strongly connected to something called variational difference,

which we'll discuss and week four.