0:03

In this video, we're going to go through an example of

a blank an expectation-maximization algorithm to a particular problem.

So, let's say that you have a data set with the following histogram.

You have discrete data which can take only three values,

one, two, or three.

And you have approximately 30% of ones, 20% of twos,

and 50% of threes and you want to fit some distribution into this data.

Well obviously, if you want to fit any distribution,

you can just use this histogram as the best fit.

But, since we want to see how EM algorithm applies to these kind of problems,

let's say that we know that these data came from a mixture,

and we want to find the parameters of this mixture.

So let's say that this actual data was generated from the following data,

from the following mixture.

The probability of x_i is equals to a mixture of two distributions,

alpha times P1 plus one minus alpha times P2.

So, P1 and P2 are probability distributions and alpha

and one minus alpha are the weights, and alpha is a parameter.

And here, I will not write conditioning on the parameters alpha and gamma et cetera,

anywhere just to forcibly stand to write a little bit fewer symbols.

But, the condition on the parameters is everywhere.

And let's say that the mixture components,

P1 and P2 looks as follows.

So we have P1 which equals to one,

two, and three with probabilities alpha,

one minus alpha, and zero.

Why? Well, because if we use some distribution which can take in

any of these three numbers with any probability,

then we can just use this histogram as the best fit for P1 and just forget about P2.

Let's forbid P1 and P2 to take some of the waitlist so

they have to cooperate to explain these data well.

Let's say that P2 has one parameter,

beta, and it takes values two and three with some probabilities which we don't know.

Let's try to estimate this alpha, beta,

and gamma by using the expectation maximization algorithm.

So, for expectation maximization algorithm, we need two things.

First of all, we need the initialization of our parameters.

Let's say, that we initialized them to be alpha0 equals to

beta0 equals to gamma0 equals 2.5.

Just for concreteness. And also,

we need to define the latent variable model because

expectation maximization is something that works good,

that was invented for latent variable models.

So, let's say that we introduced latent variable t_i for

each data point x_i which will say from which of these two components x_i came from.

So, we'll have a picture like this,

4:02

And t_i can they can take two values,

one or two with some probabilities.

This way, the prior probability of t_i

being one for example, is just gamma.

So, it's the weight of the first component of

mixture and the probability of t_i being two is just one minus gamma,

and you can easily define the conditional probability.

Probability of x_i unit for example,

t_i equals two, is

just the probability according to the second component of the mixture, that's P2 of x_i.

Now, we have everything defined and we can

use the expectation maximization algorithm to find the best values of parameters alpha,

beta, and gamma we can find with this kind of algorithms.

Let's start with the E-step or expectation step.

On the E-step, we want to find the posterior distribution on the latent variable t_i.

So we want to find q of t_i which equals to the posterior distribution,

P of t_i, given P of t_i equals to c for example and we can put c here.

Given the data point x_i and the parameters but I will

omit them as z usually in this video.

Let's see how can we do it.

Let's start with probability of t_i equals to one,

if we know that x_i equals to one.

So we can find this expression by using the base rule.

Let's think equals the full ratio so it's

proportional to the joint likelihood of the joint distribution,

P of t_i and x_i which equals to the P of of x_i equals one.

Given t_i equals one times the parameter distribution

P of t_i equals one divided by the same thing,

batch of four t_i equals 1 and t_i equals two.

So, sum with respect to all the waitlist of a latent variable.

Probability of x_i equals one and given, t_i equals one,

so the same thing as in the numerator times the prior plus the same thing,

but four t_i equals two.

So, P of x_i equals one given,

t_i equals two times the prior,

P of t_i equals two.

And we can complete this thing for our model and so we

know our current values of parameters of alpha, beta, and gamma.

This conditional distribution is just P1 of x cycles one,

so it's alpha times the prior which is gamma,

the probability for t_i equals one divided by the same thing,

alpha times gamma plus probability that this particular point came from

the second component of the mixture which is zero because

the second component of the mixture never generates us ones,

times its proper beta1 minus gamma,

which is just one.

And this totally makes sense, right?

Since we know that the second component of mixture can never generate you number one,

it just means that if we see number one data set,

it will certainly generated by t_i equals one by the first component.

And we can compute the same thing,

we can compute the same thing for other values of t_i and x_i.

For example, t_i equals one given, x_i equals two.

We'll be using the same reasoning,

the conditional distribution which is one minus

alpha times the prior distribution which is gamma,

divided by the same thing,

one minus alpha times gamma plus

the same thing assuming that

this data point came from the second component of the mixture.

So, one minus beta times one minus gamma,

the prior distribution for the second component of mixture.

And this thing, if you substitute here the numbers from our initialization,

it will be just 0.5 times 0.5 divided by 0.5 times

0.5 plus the same thing

because of our particular values for the initialization and this is just 0.5 in total.

So, we've found our posterior distribution t_i given x_i,

then we can do the same thing for other values of t_i and x_i.

This way, we can compute the E-step of

the expectation-maximization algorithm for this particular model.

In the next video, we will discuss the M-step,

so how to update the values of parameters by using

these computed conditional distributions on the latent variable.