0:03

In the previous video we sketched an algorithm

to fit Gaussian Mixture Model into a data set.

Now let's see an example of how it can look like on some particular set of data.

So assume this data again this one dimensional data set of points.

Let's try to find two Gaussians which explain this data well.

So let's start with initializing,

initialization here means to choose two random Gaussians and place them somewhere.

So let's say we choose this one Gaussian around, will you,

175 and the other one is around 180 and the variances are around, I don't know, around 5.

Around that view. So it's random initialization.

So on the next step we have to estimate the sources.

We have to estimate the posterior probabilities of the latent variability.

So from which Gaussian each data point came. Let's do it here.

So, you can see that for example the rightmost data points,

they are certainly orange.

Because the orange Gaussian explains them

much better than the blue one right? It makes sense.

The data points in between our randomly initialized Gaussians are kind of not settled.

They are half blue half orange.

We're not sure about their exact assignment to these Gaussians.

The interesting story is about the leftmost Gaussian, the leftmost point.

So on the one hand,

both Gaussians think that this data point is kind of weird.

Because, it has really low density according to both Gaussians.

And you can expect that they will both kind of not expect it to appear here.

But on the other hand the blue Gaussian is so much,

the blue Gaussian probability is so much higher,

that this data point's almost completely blue.

So you can see here that the blue Gaussian densities are around .01 here,

which is low, it's almost zero.

But the orange Gaussian density is around .00002,

which is much lower.

So the overall probability is around 1.

Which means that these data points are almost certainly blue.

And so I'll know both Gaussians do not think that these data points should exist.

The blue Gaussian explains these data points much better,

just because it's closer.

Okay so it is the end of the third subtype of the first iteration.

We have this kind of softer sense,

for which data point we know its latent variability or at least a distribution of that.

So now let's repeat our Gaussians,

let's re-find our parameters.

Well it will look something like this.

So we feed a Gaussian into all the blue points and then we feed the Gaussian

in all the orange points and we get some two Gaussians.

Note that the blue Gaussian is now much shifted to the left just because,

you were assigned lots of points on the left hand side of this plot.

And it just has to explain them so it's shifted to be around these blue datapoints.

So this is the end of iteration one and it's already a reasonable solution,

so we are not converged yet but it is already an approximation of the solution.

And from this I think we can already get some reasonable clustering,

so we can now call all the orange points one cluster,

all the blue points the other cluster,

and all the points in between we can say like we're not certain about them and so,

we can't say anything certain about these half blue half orange datapoints.

But anyway let's proceed a few iterations more.

So on the next iteration, we will re-assign.

We'll compute these assignments of latent variables, right?

So the colors of the data points will change.

So look carefully here.

They change just a little bit.

And you can see that

some of the data points which were half blue half orange,

they shifted to be just a little bit of blue and much more orange.

It's because now our blue Gaussian's more to the left and so it's explained less.

It explains the data points that are around 175 height a little bit worse.

So these data points became more orange than blue.

But overall the situation didn't change.

So we're really close to convergence.

So on the next subtype of the second situation we estimate the parameters.

And the Gaussians become a little bit more certain about

their location and their variances reduce.

And finally one more alteration.

So look carefully at the color of the data points,

they shift just a little bit more.

And again we re-estimate the Gaussians.

And we convert this kind of solution to

our original Gaussian mixture model fitting problem.

So this was an expectation maximisation algorithm

for Gaussian Mixture Model on this one dimensional data.

Note that however this algorithm doesn't give you an optimal solution.

So it doesn't give you the global maxim of your likelihood.

And actually it turns out that the problem of finding

global maxim of this likelihood is NB hard.

So you can't expect anyone to give you an optimal solution in reasonable time.

But well it gives you something reasonable right?

But sometimes this expectation maximisation suffers from local maxim a lot.

We can see an example here.

Let's initialize our Gaussians somewhere else.

Not as we initialized them in the first demo but somewhere else.

Let's say we initialized the Gaussians on the left hand side of the plot.

The first substep of the first iteration,

we estimate the sources,

we estimate the latent distribution and latent variability and looks like this.

So all the points that are more to the right than 165 height,

they will be certainly orange because the orange explains them better.

And the others will certainly be blue.

It's not a very good solution but it's not a solution yet right?

It is just the first subset of the first iteration.

It shouldn't hurt us too much.

But it turns out that because of this mistake,

because we assign so many points to be orange,

on the next substep we will have to

estimate the variance of the orange Gaussian to be high like this.

Just because it has so many points both on the right hand side and on the left hand side.

And now we have a problem.

So at the start of iteration two,

look at the leftmost point.

It will become a little bit orange.

Just because now it is explained by the orange Gaussian as well as the blue Gaussian.

Because now the orange Gaussian is high in this region.

This is a disaster.

Because now we'll repeat the Gaussians and the orange Gaussian will

think that now it not only has lots of orange points on the left and on the right,

but also it has this one leftmost point.

Well not fully, just a little chunk of this leftmost point,

but anyway it contributes to the variance estimation,

and the variance estimation of the orange Gaussian will be really high.

It thinks that it owns all the points on this plane and this will make

the orange Gaussian variance estimation stay high and

the blue Gaussian variance estimation will

reduce a little bit because it becomes more certain in its estimation,

and we're stuck in this kind of situation.

So we can do a few more iterations,

and we converged this local maximum,

which is certainly not optimal.

But the bottom line here is that

the expectation maximisation suffers from local maximums.

And as usual in optimization,

the way to deal with it is to start several times from different random initializations,

and choose the best run.

The best thing here is that you can track the log likelihood,

and you can find the best run

according to the training log likelihood performance after all.

So at least you can compare different logs to show and choose the best one.

Summary. Gaussian Mixture Model is

a flexible probabilistic model which you can fit into your data,

and it allows you to solve the clustering problem,

and also gives you a probabilistic model of your data.

So you can for example sample from this model new data points.

Expectation Maximisation algorithm is something to train this Gaussian Mixture Model,

and other latent variables models,

as we will see in the next few videos.

But for now it's just for Gaussian Mixture Model.

And sometimes it can be faster than Stochastic Gradient Descent,

and it also helps you to handle

complicated constraints like making

the covariance matrix positive semi definite on each iteration.

And finally expectation maximisation suffers from local maxima.

But you know it's kind of expected because the overall problem was NB hard,

and so thus Stochastic Gradient Descent.

So the optimal solution in a reasonable time is not possible.