Because it's a probability distribution over the parameters, over the missing

variables not a single assignment to x and Du.

With that soft completion we're going to compute what is called the expected

sufficient statistics. The expected sufficient statistics is

very similar to the sufficient statistics that we had before.

when we were computing sufficient statistics we were counting the number of

times that we saw particular combination X comma U, remember we had M.

Of x coma u as being our sufficient statistics.

Now we have a soft completion, and so, we have expectant or soft sufficient

statistics, which we are going to denote with an m hat.

Subscripted by the parameter vector that gave rise to the completion, and N hat

sub theta T for a particular assignment, X comma U, is going to be simply the sum.

Over all the data instances of the probability that we just computed.

The probability of x coma u given the assignment, d sub m, and the parameters,

theta t. And that is a notion of expected counts

or expected sufficient statistics which we're now in the context of the m-step

going to use as if they were real sufficient statistics.

And so we're going to take these expected sufficient statistics and use them to

perform maximum likely destination using the exact same formula that we would have

used had these been real sufficient statistics.

So specifically for example in the context of multinomial Bayesian networks

we would have that the next value of the parameters theta t.

E plus one given XU is N. Halved and bar of theta t.

X u, divided by m bar theta t of u, which is in, using the soft count, the fraction

of x u among the u's. Let's consider, a concrete very simple

example, which is that of, bayesian clustering, so what we see here is your

standard naive base model, where we have a single class variable, and, a bunch of

observed features, and the assumption here is that the, class variable is, is

missing, that is, it's. Sometimes or never.

In most cases, never observed in the data.

And so, effectively, we're trying to identify what are the categories or

classes of instances in this data, under the assumption that, within each

class, the feature values are, are are independent.

That is, the features are conditionally dependent, given the class, as in the

Naive Bayes model. So in substantiating our formula from the

EML algorithm for this setting, we get the following very natural formula.

So the sufficient statistics for for the class variable is simply the sum over all

of the data instances of the probability of the class variable given the observed

features for that data instances, for, for our instance and.

The and the val- current value of the parameters.

That is, it's the soft counts of how many instances fall into each category when

each instance is allowed to divide itself up across multiple categories so to use a

soft assignment. The second set of sufficient statistics,

which represent the dependency between one of the, the features and the class

variable, is M hat sub Theta of XI say comma c.

And that once again sums up over all the data instances and simply looks at the

probability of a particular class variable and a particular feature, a

very, a particular feature given the observed instances, the observed features

at that particular instance. Now with these expected counts, we can

now go ahead and do maximum likelihood estimation.

And that gives rise to the following very simple equations, which is that the theta

C is equal to M hat sub theta of C, divided by M, which is the total number

of data instances. And theta of little X-I given C is going

to be assigned M hat, theta XI comma C, divided by M hat

theta of C, which, again, is the standard, formula, but using soft, rather

than hard, counts. So to summarize the EML algorithm, if we

have in this case the exact same problem that we had in the context of the

gradient assent algorithm, which is that we need to run inference the probability

of xi, c for each data instance of each iteration.

As for the gradient assent algorithm, this only requires a single calibration

because of the same family preservation property that was useful there applies

equally here. What are the pros of the EM algorithm?

it's very easy to implement on top of an existing maximum likely destination

procedure that one might have for complete data.

So for having implemented and MLE procedure, one can easily build on top of

that by adding a separate E step for computing the effect of the expected

sufficient statistics. Empirically, it turns out that the EM

also makes very rapid process, progress in terms of converging to a better and

better set of parameters, especially in early iterations.

And we'll see some empirical results for that in a bit.

The cons of VM algorithm is that the convergence generally slows down at later

iterations. And so it requires sometimes that later

iterations might use say, some kind of conjugate gradient algorithm because it

turns out that that's more effective there.