0:14

So in all the EM algorithms we introduce a hidden variable

Â to help us solve the problem more easily.

Â In our case the hidden variable is a binary variable for

Â each occurrence of a word.

Â And this binary variable would indicate whether the word has

Â been generated from 0 sub d or 0 sub p.

Â And here we show some possible values of these variables.

Â For example, for the it's from background, the z value is one.

Â And text on the other hand.

Â Is from the topic then it's zero for z, etc.

Â 1:06

Now, the idea that we talked about before for

Â predicting the word distribution that has been used when we generate the word

Â is it a predictor, the value of this hidden variable?

Â And, so, the EM algorithm then, would work as follows.

Â First, we'll initialize all the parameters with random values.

Â In our case, the parameters are mainly the probability.

Â of a word, given by theta sub d.

Â So this is an initial addition stage.

Â These initialized values would allow us to use base roll to take a guess

Â of these z values, so we'd guess these values.

Â We can't say for sure whether textt is from background or not.

Â But we can have our guess.

Â This is given by this formula.

Â It's called an E-step.

Â And so the algorithm would then try to use the E-step to guess these z values.

Â After that, it would then invoke another that's called M-step.

Â In this step we simply take advantage of the inferred z values and

Â then just group words that are in the same distribution like these

Â from that ground including this as well.

Â 2:36

So let me also illustrate that we can group the words

Â that are believed to have come from zero sub d, and

Â that's text, mining algorithm, for example, and clustering.

Â 3:06

Note that before we just set these parameter values randomly.

Â But with this guess, we will have somewhat improved estimate of this.

Â Of course, we don't know exactly whether it's zero or one.

Â So we're not going to really do the split in a hard way.

Â But rather we're going to do a softer split.

Â And this is what happened here.

Â 3:39

And you can see this, where does this come from?

Â Well, this has come from here, right?

Â From the E-step.

Â So the EM Algorithm would iteratively improve uur initial

Â estimate of parameters by using E-step first and then M-step.

Â The E-step is to augment the data with additional information, like z.

Â And the M-step is to take advantage

Â of the additional information to separate the data.

Â To split the data accounts and then collect the right data accounts to

Â re-estimate our parameter.

Â And then once we have a new generation of parameter, we're going to repeat this.

Â We are going the E-step again.

Â To improve our estimate of the hidden variables.

Â And then that would lead to another generation of re-estimated parameters.

Â 4:39

Okay, so, as I said, the bridge between the two

Â is really the variable z, hidden variable, which indicates how likely

Â this water is from the top water distribution, theta sub p.

Â 4:56

So, this slide has a lot of content and you may need to.

Â Pause the reader to digest it.

Â But this basically captures the essence of EM Algorithm.

Â Start with initial values that are often random themself.

Â And then we invoke E-step followed by M-step to get an improved

Â setting of parameters.

Â And then we repeated this, so this a Hill-Climbing algorithm

Â that would gradually improve the estimate of parameters.

Â As I will explain later there is some guarantee for

Â reaching a local maximum of the log-likelihood function.

Â So lets take a look at the computation for a specific case, so

Â these formulas are the EM.

Â Formulas that you see before, and you can also see there are superscripts,

Â here, like here, n, to indicate the generation of parameters.

Â Like here for example we have n plus one.

Â That means we have improved.

Â From here to here we have an improvement.

Â So in this setting we have assumed the two numerals have equal probabilities and

Â the background model is null.

Â So what are the relevance of the statistics?

Â Well these are the word counts.

Â So assume we have just four words, and their counts are like this.

Â And this is our background model that assigns high probabilities to common

Â words like the.

Â 6:25

And in the first iteration, you can picture what will happen.

Â Well first we initialize all the values.

Â So here, this probability that we're interested in is normalized into a uniform

Â distribution of all the words.

Â 6:40

And then the E-step would give us a guess of the distribution that has been used.

Â That will generate each word.

Â We can see we have different probabilities for different words.

Â Why?

Â Well, that's because these words have different probabilities in the background.

Â So even though the two distributions are equally likely.

Â And then our initial audition say uniform distribution because of the difference

Â in the background of the distribution, we have different guess the probability.

Â So these words are believed to be more likely from the topic.

Â 7:20

So once we have these z values,

Â we know in the M-step these probabilities will be used to adjust the counts.

Â So four must be multiplied by this 0.33

Â in order to get the allocated accounts toward the topic.

Â 7:52

then we just get the full count of this word for this topic.

Â In general it's not going to be one point zero.

Â So we're just going to get some percentage of this counts toward this topic.

Â Then we simply normalize these counts

Â to have a new generation of parameters estimate.

Â So you can see, compare this with the older one, which is here.

Â 8:32

And of course, this new generation of parameters would allow us to further

Â adjust the inferred latent variable or hidden variable values.

Â So we have a new generation of values,

Â because of the E-step based on the new generation of parameters.

Â And these new inferred values of Zs will give us then

Â another generation of the estimate of probabilities of the word.

Â And so on and so forth so this is what would actually happen when we compute

Â these probabilities using the EM Algorithm.

Â As you can see in the last row where we show the log-likelihood,

Â and the likelihood is increasing as we do the iteration.

Â And note that these log-likelihood is negative because the probability is

Â between 0 and 1 when you take a logarithm, it becomes a negative value.

Â Now what's also interesting is, you'll note the last column.

Â And these are the inverted word split.

Â And these are the probabilities that a word is believed to

Â have come from one distribution, in this case the topical distribution, all right.

Â And you might wonder whether this would be also useful.

Â Because our main goal is to estimate these word distributions.

Â So this is our primary goal.

Â We hope to have a more discriminative order of distribution.

Â But the last column is also bi-product.

Â This also can actually be very useful.

Â You can think about that.

Â We want to use, is to for

Â example is to estimate to what extent this document has covered background words.

Â And this, when we add this up or

Â take the average we will kind of know to what extent it has covered background

Â versus content was that are not explained well by the background.

Â [MUSIC]

Â