0:00

[MUSIC]

In the previous video we decided to solve the clustering

problem by making a probabilistic model of our data.

But how? How can we model our

data probabilistically?

Well we don't know that many distributions so far.

But we know Gaussians, right?

So in week one we learned how to feed the parameters of the Gaussian

distribution into the data set.

And we can do it here, it's a reasonable thing to do.

But it turns out that Gaussian is not the very best model for this kind of data.

Recall that we decided that our data consist of several clusters or

groups which may be far away from each other.

And even with this simple example, we can see that Gaussian has

to model all the data points as one big circle or maybe ellipse.

And in this case, It just has to assign hyper mobility to the center of

the circle, it's the way Gaussian works.

And you can see here that the center of the Gaussian kind of fall into

the region in between the clusters where there are not that many data points.

And this is problem with modeling this kind of data with one Gaussian.

We have to model everything with one big circle, and

we have some regions in between the clusters where there are few data points.

But the Gaussian just has to assign hypergrowth to these regions.

So, what else can we do?

Are there any better probabilistic models we can use here?

Well, if one Gaussian doesn't work let's use several of them like three for

example.

So in this case we are kind of assuming that each data point came from one of

the three Gaussians, and each Gaussian explains one cluster of the data points.

Or if you put it more formally then the density of each data point

equals to the weighted sum of three Gaussian densities, and

the weights by are some negative number which sum

up to 1 to make an actual probability distribution.

So the parameters theta here are the weights by.

And three groups of parameters of three Gaussians.

Their locations, the mu vectors, and their shapes, or

their covariance matrices sigma.

Notice that if we succeed in fitting this kind of model into our data,

we will solve clustering problem.

Because for each data point,

we may now find from which Gaussian this data point came from.

And this is exactly the alternative to finding the cluster index.

So we may say that all the points that came from one

Gaussian is the points of one particular cluster.

This models is sometimes called Gaussian Mixture Model, or GMM for short.

So great, we have found model which we would like to use.

What are some positive and negative features of this model,

compared to the just plain one Gaussian?

Well obviously, Gaussian is much less flexible.

So Gaussian Mixture Model allowed us to fit our complicated dataset, and

it actually turns out that you may fit just almost any probability

distribution with Gaussian Mixture Model with arbitrarily high accuracy.

Well of course as always happens with this kind of theorems so in the worst case you

may have to use exponentially many Gaussians, so it's not a very practical

theorem, but anyway Gaussian Mixture Model is very powerful and flexible model.

The downside is obviously the number of parameters.

So, as you increase the number of Gaussians you use for

your data you increase the number of parameters by the same texture right?

Okay great.

We decided to use this kind of Gaussian mixture model.

But how can we fit it?

How can we find its parameters?

So it's bi, mu and sigma vectors and matrices.

Well the simplest way to fit a probability distribution is to use

maximum likelihood estimation right?

So we want to find the values of parameters which maximise the likelihood,

which is the density of our data set given the parameters.

And we want to maximise this thing with respect to the parameters.

It's usually much in learning, we will assume that the data set consists

of n data points, which are independent given our parameters.

Which basically means that we can factorize the likelihood.

So the likelihood equals the product of likelihood of individual objects.

5:02

Well, One more thing we can do to

simplify this expression, to understand how to maximize it,

is to substitute the definition of the marginal likelihood of xi given

the parameters by using the definition from a few slides before.

So each data point density is a mixture of Gaussian densities, right?

Okay, so now we have this kind of optimization problem.

And just one more thing we forgot here is, we have some constraints, right?

We have to say that the weights pi are non-negative, and that they sum up to 1.

Because otherwise, it will not be an actual probability distribution.

But now it seems like we're good to go.

Now we may use your favorite stochastic optimization algorithm from data flow,

like item or whatever you would like to use,

and we can optimize this thing to find the optimal parameters, right?

Well, it turns out that we kind of forgot one important set of constraints here.

The covariance [INAUDIBLE] as a sigma cannot be arbitrary.

Imagine that you have,

that your optimization logarithm proposed to use covariance matrix with all zeros.

It just doesn't work. It doesn't define a proper Gaussian

distribution.

Because in Gaussian distribution definition you have to Invert this matrix,

and you have to compute its determinant, and divide by it.

So if you have a matrix which is all 0s,

you will have lots of problems like division by 0, and stuff like that.

So it's not a good idea to assume that the covariance matrix can be anything.

And actually, the set of valid covariance matrices is something called Positive.

Don't worry if you don't know what it is, it's not the important right now.

The important part is that it's a really hard constraint to follow,

so it's hard to adapt your favorite stochastic rate in descent to

always follow of this kind of restraint.

So to maintain this property that the matrices are always positive,

semi-definite I don't know how to do it efficiently, so we have a problem here,

we don't know how to optimize this thing, at least with stochastic rate and descent.

Well, it turns out that even if you get this constraint, so

if you consider a simpler model, for

example if you say that all the covariance matrices are diagonal, which

means that the ellipsoids that corresponds to the Gaussians cannot be rotated.

They have to be aligned with the axis.

In this case it's much easier to maintain this constraint.

And you can actually use some stochastic optimization here.

So for example, in this example I used Adam and

I tuned it learned grade to optimize this thing.

And you can see that it's doing a reasonable job here.

So the blue curve is the, Performance of the item here.

On the x-axis we see epochs, and on the y-axis we see log likelihood,

which we are trying to maximize.

And so, Adam is doing a good job, right?

It's like ten e-books optimizing this thing to to something reasonable.

I think real line here is the ground for which came from minor because I know

from which probability distribution I generate this data, so,

I know the optimal value for the log likelihood.

But it turns out that even in this case where we don't have this very complicated

constraints you can make so much better by exploiting the structure of your problem.

And this is something we're going to discuss in the rest of the week,

is something called the expectation maximization.

And if you apply it here, it just works so much better.

In like a few iterations it found the value,

which is better than the Ground Truth, which probably is overfitting,

but anyway it works good on the test set also.

So to summarize, we may have two reasons to not

to use this stochastic gradient descent here.

First of all, it may be hard to follow some constraints which you may care about,

like positive semidefinite covariance matrices.

And second of all, expectation maximization algorithm,

which can exploit the structure of your program,

sometimes is much more faster and efficient.

So as a general summary we discussed that the Gaussian mixture model is a flexible

probability distribution, which can solve the clustering problem for

you if you fit your data into this Gaussian mixture model.

And sometimes it's hard to optimize for the stochastic gradient descent, but

there is this alternative which we'll talk about in the next video.