Now let's look at the popular algorithm for hard clustering called

K-means and see if we can connect it to the general form of the expectation maximization.

So, it would be really cool to drive,

it is a special case of EM and to feed it in the general framework.

So hard clustering.

We have a data set of points and if we want to assign each data point to

some of the clusters and now we can't say that suddenly the point,

it belongs to several clusters simultaneously.

Now which data point should be assigned to one and just one cluster, okay?

The K-means algorithm suggest

you to solve this hard clustering problem in the following way.

First of all, let's initialize the parameters randomly.

And here are the parameters are just means so the locations of the cluster.

We don't have weights by and we don't have

some shapes for the clusters Sigma, just locations.

And then in iterations,

we're repeating two sub steps until convergence.

So the first sub step is to find for each data point,

find the closest cluster and assign these data points to this cluster.

So each data point is assigned to belong to the closest cluster to it.

And by closest, I mean according to Euclidean distance.

And on the next sub step,

K-means suggest you to update the parameters by finding,

so for example, to update the first Gaussian,

the first cluster centroid Mu_1

we'll have to find all the data points which we are assigned

to the first cluster on the first sub step and

then find their average which will be the center of this cluster,

updated center and then we repeat these two things until convergence.

So if you look carefully at this algorithm,

it looks really similar to the EM, right?

We also have random initializations and then we in iterations repeat

two steps which also look really close to the stuff we had in the EM.

First of all, we compute some property for

each data point and then we update the parameters by using these properties.

Let's see if we can somehow convert our Gaussian Mixture Model and EM applied to it,

so to obtain exactly this K-means algorithm.

So to prove that the K-means is just a special case of

this EM algorithm applied to Gaussian Mixture Model or maybe to some other model.

So first of all, we have to say that we don't have these additional parameters, right?

So let's say that the covariance matrices of the shapes are all

identical matrices which means that all the shapes of each Gaussian is

just uniform circle with fixed radius and the prior weights pi are all uniform also.

So they equal to one divided by the number of clusters.

This way we will have only Mu as the parameter of our Gaussian Mixture Model or kind of

restricted Gaussian Mixture Model and then

the density of each data point given that we know the cluster,

now looks like this.

So, it's some normalization times exponent of

point five times the Euclidean distance between x1 and Mu_c where Mu_c

is the center of the Gaussian number c. You can prove that

this is exactly the formula for multidimensional Gaussian, multivarian Gaussian.

If you have identical covariance matrix sigma,

if it equals to identical matrix.

Now, let's look at the E and M steps of this restricted Gaussian Mixture Model.

On the E -step, the expectation-maximization algorithm suggests you

to find the minimum of the KL divergence

between the variational distribution q

and the posterior distribution p(t) given data and parameters.

And if we are going to do it,

we will find not hard assignments but soft assignments, right?

As we obtained in the Gaussian Mixture Model.

So for each data point this q will say with

which probability it belongs to one cluster or another or one Gaussian or another.

And this is not what we want.

We want to derive K-means from this model.

So we want this step to be hard clustering.

How can we do it? Well, let's find

not the best optimal q but q among some family of simple qs.

So let's look at the q among all delta functions.

Where delta function means that this probability distribution takes

one value of probability one and all the other values with probability zero.

It's really certain about what the outcome will be.

So, let's imagine that the posterior distribution looks like this.

So our data point belongs to the cluster number two with probability

like 0.6 and cluster number one with probability 0.4. What do you think?

What the delta function approximation will be?

So what will be the closest q to it according to

KL distance among the family of delta functions?

Well, it will be a distribution like this.

It will put all the probability mass into the class that

had the highest probability according to our posterior.

So by restricting the set of possible qs on the E-step,

we're actually approximating the actual posterior distribution with delta functions.

And this way the optimal q on the E-step will look like this, right?

It's some delta function.

So it's probability one then t_i equals to some predefined value and zero otherwise.

And what is c_i? What does this predefined value?

Well, it's the most probable configuration according to our posterior distribution.

So, we have to find c_i which maximizes

the posterior probability of latent variable t_i given the data and the parameters theta.

Recall that the posterior distribution itself

is proportional to the full joint distribution which is x,

u and t as p of t and

parameters and we also have some normalization constant but we don't care about it

because we want to maximize this thing with respect to

t_i or with respect to c which like respect to the value of t_i.

So normalization doesn't change anything.

And if you recall,

this thing equals to normalization times the x given t which is exponent of

minus 0.5 times Euclidean distance and times

the prior and we agree that the prior will be just uniform.

So it doesn't depend on c and we also don't care about when we maximize this thing.

So finally, we want to maximize this expression with respect to

the value of t_i and we have normalization one divided by Z and we have.

pi of c which both actually doesn't depend on c. So,

we can throw it away and will not change anything.

And maximizing the exponent of minus something is the same as minimizing something.

So we can just minimize 0.5 times Euclidean distance.

And finally, we have an expression like this.

So our optimal q from the E-step is the same as,

is just a delta function which takes the value of arg minimum of the Euclidean distance.

So the closest cluster center with probability

one and it has probability zero to be anything else.

Okay?

So, this is exactly like we have in K-means,

which means that this restricted Gaussian Mixture Model if you apply EM to it

and if you restrict your possible values

of the variational distribution q on the E- step to be delta functions,

you will get exactly the same first sub step of the iteration.

So now, let's derive the formula for

the M-step on the blackboard and see how it's connected,

how does the formula for the second sub step of K-means and x to

the M-step of the expectation maximization

apply to this restricted Gaussian Mixture Model with

this particular choice of q which is delta function.