0:03

In the previous video,

Â we discussed that we want to build an expectation-maximization algorithm,

Â for this simple discrete probabilistic fitting problem.

Â And we finished the E-step which is summarized in the bottom of this desk.

Â So now, let's go ahead and proceed with the M-step.

Â So, on the M-step,

Â we want to maximize the expected value,

Â maximized with respect to the parameters.

Â So in this case, alpha, beta,

Â and gamma with respect

Â to the objects in the data set of expected values,

Â with respect to q(ti) of joint distributions,

Â logarithms of the joint p(ti),

Â and (xi) which we can rewrite as a product of

Â p(xi/ti) times and again,

Â given the parameters, but I will omit them everywhere,

Â that's the prior of p(ti).

Â And we can rewrite this thing by using the definition of

Â the expected value which is sum with respect to all possible variables for (ti).

Â And there are two possible variables for (ti), here.

Â So, it will be sum,

Â the first term will be, corresponding to (ti=1).

Â So, for data points that are coming

Â from the first component of the mixture, here'll be logarithm,

Â here'll be the weight, probability (ti=1) times

Â the logarithm of

Â (xi/ti=1) times the prior,

Â and the prior here is just gamma.

Â So probability of (ti=1),

Â plus the second term

Â which is the same thing, but for (ti=1).

Â So, again sum with respect to all the objects in data set.

Â The variations fusion Q (ti=2) times

Â the logarithm of the joint distribution which is a logarithm of p(xi/ti=2),

Â just p2(xi) and times the prior which one minus gamma.

Â And we want to maximize this thing with respect to parameters.

Â So, with respect to alpha, beta, and gamma.

Â Let's rewrite this expression by using the definition of Q,

Â using the one we derived from the E-step,

Â and also substituting the definition of p1 and p2.

Â So here, we'll get something like this.

Â To start with, let's make more concrete, which data we have.

Â So, let's say that we observed

Â 30 objects equals to xi=1.

Â So, N1=30, It's number of observations xi which

Â are equal to one, N2=20, and N3=60.

Â So this is a complete definition of our data set.

Â We have 31 and et cetera.

Â And now, we can proceed with this fitting algorithm.

Â So, now we're going to write this thing,

Â and instead of writing everywhere sum with respect to objects,

Â let's decompose this thing into free terms.

Â First of them, corresponding to the xi=1, and et cetera.

Â So the first term will be,

Â there will be 30 terms equals to each

Â other that corresponds to that xi that are equal to one.

Â Right? And so, q(ti=1).

Â In this case, will be just p ti=1 given that xi=.

Â So, it's the terms corresponding to xi=1, right?

Â And this thing is just one.

Â If we look up our results from the E-step.

Â This is one and times the logarithm of the joint distributions.

Â So, logarithm of p1 of one which is alpha,

Â times the prior which is gamma.

Â So, permuted that ti=1 is the first term of the first term here.

Â The second term will correspond to xi=2.

Â So there will be 20 of them.

Â The probability of q(ti),

Â in this case will be 0.5 from here.

Â So it will be times 0.5 times the logarithm of

Â the joint distribution which is one minus alpha times the prior.

Â It's p1 of two,

Â plus the third term.

Â So there will be 60,

Â xi=3, and the probability, this q(ti) will be zero.

Â So, this will be zero,

Â times logarithm of also zero which may look like a problem,

Â but actually, it's usually considered that zero times logarithm of zero is zero.

Â So, even if you do it carefully it will be okay.

Â You just kind of ignore this term in maximization problem.

Â This thing is ignored because it's zero.

Â Plus the terms corresponding to this part.

Â So, plus the terms corresponding to

Â the xi that came from the second component of the mixture.

Â Again, there will be 30,

Â xi=1 times the variation probability of q(ti=1).

Â And this thing could be.

Â p(ti=2).

Â So, second component of the mixture,

Â given xi is 1,

Â and this is zero, right?

Â So, we assume that no one's came from the second component of the mixture.

Â So again, it was times logarithm of zero,

Â but anyway, we ignored this part.

Â This is zero, and plus the final two terms which

Â is 20 times 0.5 which is q,

Â times logarithm of one minus beta,

Â times one minus alpha,

Â or one minus gamma which is prior probability for the second component of the mixture.

Â And this last term is

Â 60 times 1 times logarithm of beta,

Â times the prior probability for second component which is one minus gamma.

Â So finally, we have this expression,

Â and we want to maximize it with respect to alpha, beta, and gamma.

Â So, let's try to simplify it a little bit.

Â So, get lid of the zeros,

Â and things like that that can make things harder for us.

Â And also, move all the bar sets,

Â it depends on the same parameter together.

Â So let's start with maximizing but with respect to alpha for example.

Â Let's identify the terms that depend on alpha.

Â So this thing will equal to...

Â So, this part does depend on alpha.

Â It will be 30 times logarithm of alpha,

Â plus 30 times logarithm of gamma,

Â but we don't care about it,

Â because it's a cost with respect to alpha.

Â And now, we want to maximize with respect alpha.

Â So, plus 10 times logarithm of one minus alpha,

Â and there are no alphas anywhere else.

Â So, plus constant with respect to alpha.

Â And we want to maximize this expression.

Â How can we do it? Well, let's take the gradient,

Â and set it to zero as usually.

Â So, the gradient of this expression with respect to alpha will be

Â 30 times gradient of logarithm of alpha which is one divided by alpha,

Â plus 10 times gradient of logarithm of one minus alpha,

Â it's one divided by one minus alpha, minus one.

Â And this thing, equals to zero.

Â And finally, we can solve this equation by now,

Â we're just writing down this ratio.

Â 30 divided by alpha,

Â equals to 10 divided by one minus alpha,

Â and by multiplying each part by one minus alpha, and alpha,

Â we can get that, well,

Â 30 minus 30 alpha,

Â equals to 10 alpha.

Â And finally, which means that alpha equals to 30 divided by 40.

Â So, we found our alpha after the M-step.

Â So we updated the parameters of our alpha variable model.

Â We started from the initialization,

Â alpha equals beta, equals gamma, equals 0.5,

Â and after one step of the expectation maximization algorithm,

Â alpha started to be three divided by four.

Â So, we can solve the same optimization problem with respect to beta and gamma,

Â and get the following numbers.

Â So, beta will be six divided by seven,

Â and gamma will be four divided by 11.

Â So, we found our parameters, and actually,

Â it turns out that if you repeat this expectation maximization steps a few more times,

Â you will not update the parameters anymore.

Â So, we have already converge up to one iteration of expectation maximization.

Â And this parameter kind of make sense.

Â They say that the first component of a mixture mostly focus on [inaudible] ones.

Â So, the probability of changing one is three divided by four.

Â It's larger than 0.5,

Â and the second component of the mixture focus on number threes.

Â And we're a little bit less in terms of the first component and of the second one,

Â because there are less one's and three's.

Â So, last week, consult in details

Â this one step of the expectation maximization algorithm for a discrete mixture model.

Â