0:03

In this video, we're going to go through an example of

Â a blank an expectation-maximization algorithm to a particular problem.

Â So, let's say that you have a data set with the following histogram.

Â You have discrete data which can take only three values,

Â one, two, or three.

Â And you have approximately 30% of ones, 20% of twos,

Â and 50% of threes and you want to fit some distribution into this data.

Â Well obviously, if you want to fit any distribution,

Â you can just use this histogram as the best fit.

Â But, since we want to see how EM algorithm applies to these kind of problems,

Â let's say that we know that these data came from a mixture,

Â and we want to find the parameters of this mixture.

Â So let's say that this actual data was generated from the following data,

Â from the following mixture.

Â The probability of x_i is equals to a mixture of two distributions,

Â alpha times P1 plus one minus alpha times P2.

Â So, P1 and P2 are probability distributions and alpha

Â and one minus alpha are the weights, and alpha is a parameter.

Â And here, I will not write conditioning on the parameters alpha and gamma et cetera,

Â anywhere just to forcibly stand to write a little bit fewer symbols.

Â But, the condition on the parameters is everywhere.

Â And let's say that the mixture components,

Â P1 and P2 looks as follows.

Â So we have P1 which equals to one,

Â two, and three with probabilities alpha,

Â one minus alpha, and zero.

Â Why? Well, because if we use some distribution which can take in

Â any of these three numbers with any probability,

Â then we can just use this histogram as the best fit for P1 and just forget about P2.

Â Let's forbid P1 and P2 to take some of the waitlist so

Â they have to cooperate to explain these data well.

Â Let's say that P2 has one parameter,

Â beta, and it takes values two and three with some probabilities which we don't know.

Â Let's try to estimate this alpha, beta,

Â and gamma by using the expectation maximization algorithm.

Â So, for expectation maximization algorithm, we need two things.

Â First of all, we need the initialization of our parameters.

Â Let's say, that we initialized them to be alpha0 equals to

Â beta0 equals to gamma0 equals 2.5.

Â Just for concreteness. And also,

Â we need to define the latent variable model because

Â expectation maximization is something that works good,

Â that was invented for latent variable models.

Â So, let's say that we introduced latent variable t_i for

Â each data point x_i which will say from which of these two components x_i came from.

Â So, we'll have a picture like this,

Â 4:02

And t_i can they can take two values,

Â one or two with some probabilities.

Â This way, the prior probability of t_i

Â being one for example, is just gamma.

Â So, it's the weight of the first component of

Â mixture and the probability of t_i being two is just one minus gamma,

Â and you can easily define the conditional probability.

Â Probability of x_i unit for example,

Â t_i equals two, is

Â just the probability according to the second component of the mixture, that's P2 of x_i.

Â Now, we have everything defined and we can

Â use the expectation maximization algorithm to find the best values of parameters alpha,

Â beta, and gamma we can find with this kind of algorithms.

Â Let's start with the E-step or expectation step.

Â On the E-step, we want to find the posterior distribution on the latent variable t_i.

Â So we want to find q of t_i which equals to the posterior distribution,

Â P of t_i, given P of t_i equals to c for example and we can put c here.

Â Given the data point x_i and the parameters but I will

Â omit them as z usually in this video.

Â Let's see how can we do it.

Â Let's start with probability of t_i equals to one,

Â if we know that x_i equals to one.

Â So we can find this expression by using the base rule.

Â Let's think equals the full ratio so it's

Â proportional to the joint likelihood of the joint distribution,

Â P of t_i and x_i which equals to the P of of x_i equals one.

Â Given t_i equals one times the parameter distribution

Â P of t_i equals one divided by the same thing,

Â batch of four t_i equals 1 and t_i equals two.

Â So, sum with respect to all the waitlist of a latent variable.

Â Probability of x_i equals one and given, t_i equals one,

Â so the same thing as in the numerator times the prior plus the same thing,

Â but four t_i equals two.

Â So, P of x_i equals one given,

Â t_i equals two times the prior,

Â P of t_i equals two.

Â And we can complete this thing for our model and so we

Â know our current values of parameters of alpha, beta, and gamma.

Â This conditional distribution is just P1 of x cycles one,

Â so it's alpha times the prior which is gamma,

Â the probability for t_i equals one divided by the same thing,

Â alpha times gamma plus probability that this particular point came from

Â the second component of the mixture which is zero because

Â the second component of the mixture never generates us ones,

Â times its proper beta1 minus gamma,

Â which is just one.

Â And this totally makes sense, right?

Â Since we know that the second component of mixture can never generate you number one,

Â it just means that if we see number one data set,

Â it will certainly generated by t_i equals one by the first component.

Â And we can compute the same thing,

Â we can compute the same thing for other values of t_i and x_i.

Â For example, t_i equals one given, x_i equals two.

Â We'll be using the same reasoning,

Â the conditional distribution which is one minus

Â alpha times the prior distribution which is gamma,

Â divided by the same thing,

Â one minus alpha times gamma plus

Â the same thing assuming that

Â this data point came from the second component of the mixture.

Â So, one minus beta times one minus gamma,

Â the prior distribution for the second component of mixture.

Â And this thing, if you substitute here the numbers from our initialization,

Â it will be just 0.5 times 0.5 divided by 0.5 times

Â 0.5 plus the same thing

Â because of our particular values for the initialization and this is just 0.5 in total.

Â So, we've found our posterior distribution t_i given x_i,

Â then we can do the same thing for other values of t_i and x_i.

Â This way, we can compute the E-step of

Â the expectation-maximization algorithm for this particular model.

Â In the next video, we will discuss the M-step,

Â so how to update the values of parameters by using

Â these computed conditional distributions on the latent variable.

Â