Finally, in this video,

let's talk about the only remaining part in our general diagram of types of clustering,

namely soft or probabilistic clustering methods.

First, let me know that probabilistic clustering methods

are based on probabilistic modeling of data itself.

A probabilistic modeling of data has a number

of advantages over non-probabilistic approaches.

First, it allows you to put some confidence balance on your estimated model parameters,

such as cluster labels.

Second, as it provides a probabilistic model of data,

we can simulate from this model to handle any missing data.

Let's illustrate probabilistic clustering model using Gaussian mixtures.

Gaussian mixtures are example of mixture distributions,

which are often used to describe complex data,

for example, multi-model data.

For data described by vector y,

the probability of data is modeled using a set of

basis probability distributions defined on possible values of y.

While any basis function can be used in such mixtures,

a mixture of Gaussians is the most popular choice for modeling continuous distributions.

Then the mixing distribution takes sum of

distributions p of y conditional theta with weights pi K.

Each component is characterized by its mean and variants that together form a vector of

parameters theta K for this component or weights are no negative sum up to 1.

Now, we can think of this weights pi K as probabilities for

some hidden or latent discrete variable s to take

the value K. Then we can write the probability of data y

as joint distribution of y and s marginalized over the values of s,

as shown in the last formula here.

Now, estimation of the model amounts to estimating parameters mu K, sigma K,

as well as inference of the hidden variable s,

and this can be done using the so-called EM or expectation maximization algorithm.

The EM algorithm is a very popular machine learning algorithm

used for many problems that both given variables.

Here, I want to present that in their general form,

that would be applicable not only to

Gaussian mixture models but for many other models with hidden variables as well.

To present the algorithm in its general form,

let's assume that we have some observed variables y and hidden variables

x with a joint probability p of x and y that is parameterized by parameters theta.

Then for the log likelihood of p of y,

we can write it as log of integral of p of x and y over all values of x.

Then we can multiply and divide the integrant by

an arbitrary distribution q of x defined on the hidden variables x.

Finally, due to concavity of the algorithm,

this expression is larger or equal to the integral,

where the log appearing inside of the integral.

This last step is known as Jensen's inequality.

This final expression, therefore,

provides low bound on the true likelihood of the theta.

The EM algorithm amounts to an iterative maximization of this low bound on

the log likelihood with respect to

distribution q of x and then with respect to parameters theta.

Here is how it works in more details.

In the E-step, we optimize with respect to the distribution q of

x while keeping the parameters theta fixed from the previous iteration.

This gives the next duration of q of x as arg max of this expression.

This produces the result that the most probable value of q of x is

the distribution of x conditional on the observed data y and the current value of theta.

This follows by the M-step.

In this step, we maximize the low bound on the log likelihood with respect

to parameters theta while keeping the distribution q of x fixed.

This provides an explicit equation for

the optimal parameter theta that can be observed numerically.

Now, the E-step and the M-step of the EM algorithm are

repeated until convergence or until a maximum number of iterations is achieved.

The EM algorithm can be shown to guarantee that the log likelihood increases or

at least stays constant at each step and

always finds a local maximum of the log likelihood.

The EM algorithm can be applied to any model that has hidden variables.

While here we only talked about Gaussian mixture models,

there are other models,

for example, the probabilistic PCA or factor

models that can be estimated using the EM algorithm.

For our particular case of Gaussian mixtures,

the algorithm simply places the distribution q of x over

hidden variables by a discrete distribution of variable x over the Gaussian components.

This distribution provides probabilistic clustering of data,

where instead of a fixed number or

fixed cluster labels for each point as in k-mean's clustering,

clustering memberships are defined in terms of probabilities.

Therefore, you can use these probabilities to

quantify your confidence in different clustering assignments.

Now, let's turn to our control question for this video.