0:21

So this is a slide that you have seen before and here, we show how we define

Â the mixture model for text crossing and what the likelihood function looks like.

Â And we can also compute the maximum likelihood estimate,

Â to estimate the parameters.

Â In this lecture, we're going to do talk more about how exactly we're going to

Â compute the maximum likelihood estimate.

Â As in most cases the Algorithm can be used to solve this problem for mixture models.

Â So here's the detail of this Algorithm for document clustering.

Â Now, if you have understood how Algorithm works for

Â topic models like TRSA, and I think here it would be very similar.

Â And we just need to adapt a little bit to this new mixture model.

Â So as you may recall Algorithm starts with initialization of all the parameters.

Â So this is the same as what happened before for topic models.

Â 1:28

And then we're going to repeat until the likelihood converges and

Â in each step we'll do E step and M step.

Â In M step,

Â we're going to infer which distribution has been used to generate each document.

Â So I have to introduce a hidden variable Zd for

Â each document and this variable could take a value from the range of 1 through k,

Â representing k different distributions.

Â 1:59

More specifically basically, we're going to apply base rules to infer which

Â distribution is more likely to have generated this document, or

Â computing the posterior probability of the distribution given the document.

Â 2:17

And we know it's proportional to the probability

Â of selecting this distribution p of Z the i.

Â And the probability of generating this whole document from the distribution which

Â is the product of the probabilities of world for this document as you see here.

Â Now, as you all clear this use for kind of remember,

Â 2:45

the normalizer or the constraint on this probability.

Â So in this case, we know the constraint on this probability in

Â E-Step is that all the probabilities of Z equals i must sum to 1.

Â Because the documented must have been generated from precisely one of these k

Â topics.

Â So the probability of being generated from each of them should sum to 1.

Â And if you know this constraint, then you can easily compute this distribution

Â as long as you know what it is proportional to.

Â So once you compute this product that you see here, then you simply normalize

Â 3:31

these probabilities, to make them sum to 1 over all the topics.

Â So that's E-Step, after E-Step we want to know which distribution is

Â more likely to have generated this document d, and which is unlikely.

Â 3:45

And then in M-Step we're going to re-estimate all the parameters based on

Â the in further z values or in further knowledge about which distribution

Â has been used to generate which document.

Â So the re-estimation involves two kinds of parameters 1 is p of theta and

Â this is the probability of selecting a particular distribution.

Â Before we observe anything,

Â we don't have any knowledge about which cluster is more likely.

Â But after we have observed that these documents,

Â then we can crack the evidence to infer which cluster is more likely.

Â And so this is proportional to the sum of

Â the probability of Z sub d j is equal to i.

Â 4:54

Now the other kind of parameters are the probabilities of words in each

Â distribution, in each cluster.

Â And this is very similar to the case piz and

Â here we just report the kinds of words that are in

Â documents that are inferred to have been generated

Â from a particular topic of theta i here.

Â This would allows to then estimate how many words have

Â actually been generated from theta i.

Â And then we'll normalize again these accounts in the probabilities so

Â that the probabilities on all the words would sum to up.

Â Note that it's very important to understand these constraints as

Â they are precisely the normalizing in all these formulas.

Â And it's also important to know that the distribution is over what?

Â 5:54

For example, the probability of theta is over all the k topics,

Â that's why these k probabilities will sum to 1.

Â Whereas the probability of a word given theta is a probability distribution

Â over all the words.

Â So there are many probabilities and they have to sum to 1.

Â So now, let's take a look at a simple example of two clusters.

Â I've two clusters, I've assumed some initialized values for

Â the two distributions.

Â And let's assume we randomly initialize two probability of

Â selecting each cluster as 0.5, so equally likely.

Â And then let's consider one document that you have seen here.

Â There are two occurrences of text and two occurrences of mining.

Â So there are four words together and medical and

Â health did not occur in this document.

Â So let's think about the hidden variable.

Â 6:50

Now for each document then we much use a hidden variable.

Â And before in piz, we used one hidden variable for

Â each work because that's the output from one mixture model.

Â So in our case the output from the mixture model or

Â the observation from mixture model is a document, not a word.

Â So now we have one hidden variable attached to the document.

Â Now that hidden variable must tell us which distribution has been used to

Â generate the document.

Â So it's going to take two values, one and two to indicate the two topics.

Â 7:25

So now how do we infer which distribution has been used generally d?

Â Well it's been used base rule, so it looks like this.

Â In order for the first topic theta 1 to generate a document,

Â two things must happen.

Â First, theta sub 1 must have been selected.

Â So it's given by p of theta 1.

Â Second, it must have also be generating the four words in the document.

Â Namely, two occurrences of text and two occurrences of sub mining.

Â And that's why you see the numerator has the product of the probability of

Â selecting theta 1 and the probability of generating the document from theta 1.

Â So the denominator is just the sum of two possibilities of generality in

Â this document.

Â And you can plug in the numerical values to verify indeed in this case,

Â the document is more likely to be generated from theta 1,

Â much more likely than from theta 2.

Â So once we have this probability,

Â we can easily compute the probability of Z equals 2, given this document.

Â How?

Â Well, we can use the constraint.

Â That's going to be 1 minus 100 over 101.

Â So now it's important that you note that in such a computation there

Â is a potential problem of underflow.

Â And that is because if you look at the original numerator and the denominator,

Â it involves the competition of a product of many small probabilities.

Â Imagine if a document has many words and

Â it's going to be a very small value here that can cause the problem of underflow.

Â So to solve the problem, we can use a normalize.

Â So here you see that we take a average of all these two math

Â solutions to compute average at the screen called a theta bar.

Â 9:24

And this average distribution would be comparable to

Â each of these distributions in terms of the quantities or the magnitude.

Â So we can then divide the numerator and

Â the denominator both by this normalizer.

Â So basically this normalizes the probability of generating

Â this document by using this average word distribution.

Â So you can see the normalizer is here.

Â And since we have used exactly the same normalizer for the numerator and

Â the denominator.

Â The whole value of this expression is not changed but by doing this normalization

Â you can see we can make the numerators and the denominators more manageable

Â in that the overall value is not going to be very small for each.

Â And thus we can avoid the underflow problem.

Â 10:24

In some other times we sometimes also use logarithm of the product

Â to convert this into a sum of log of probabilities.

Â This can help preserve precision as well, but

Â in this case we cannot use algorithm to solve the problem.

Â Because there is a sum in the denominator, but

Â this kind of normalizes can be effective for solving this problem.

Â So it's a technique that's sometimes useful in other situations in other

Â situations as well.

Â Now let's look at the M-Step.

Â So from the E-Step we can see our estimate of which distribution is more likely to

Â have generated a document at d.

Â And you can see d1's more like got it from the first topic,

Â where is d2 is more like from second topic, etc.

Â Now, let's think about what we need to compute in M-step well

Â basically we need to re-estimate all the parameters.

Â First, look at p of theta 1 and p of theta 2.

Â How do we estimate that?

Â Intuitively you can just pool together these z, the probabilities from E-step.

Â So if all of these documents say, well they're more likely from theta 1,

Â then we intuitively would give a higher probability to theta 1.

Â In this case, we can just take an average of these

Â probabilities that you see here and we've obtain a 0.6 for theta 1.

Â So 01 is more likely and then theta 2.

Â So you can see probability of 02 would be natural in 0.4.

Â What about these word of probabilities?

Â Well we do the same, and intuition is the same.

Â So we're going to see,

Â in order to estimate the probabilities of words in theta 1,

Â we're going to look at which documents have been generated from theta 1.

Â And we're going to pull together the words in those documents and normalize them.

Â So this is basically what I just said.

Â 12:20

More specifically, we're going to do for example, use all the kinds of text in

Â these documents for estimating the probability of text given theta 1.

Â But we're not going to use their raw count or total accounts.

Â Instead, we can do that discount them by the probabilities that each document

Â is likely been generated from theta 1.

Â So these gives us some fractional accounts.

Â And then these accounts would be then normalized

Â in order to get the probability.

Â Now, how do we normalize them?

Â Well these probability of these words must assign to 1.

Â So to summarize our discussion of generative models for clustering.

Â Well we show that a slight variation of topic model can be used for

Â clustering documents.

Â And this also shows the power of generating models in general.

Â By changing the generation assumption and changing the model slightly we can achieve

Â different goals, and we can capture different patterns and types of data.

Â So in this case, each cluster is represented by unigram language model

Â word distribution and that is similar to topic model.

Â So here you can see the word distribution actually generates a term cluster

Â as a by-product.

Â A document that is generated by first choosing a unigram language model.

Â And then generating all the words in the document are using

Â just a single language model.

Â And this is very different from again copy model where we can generate

Â the words in the document by using multiple unigram language models.

Â 14:16

partition documents into disjointed clusters.

Â Then we can just force a document into the cluster corresponding to the words

Â distribution that's most likely to have generated the document.

Â We've also shown that the Algorithm can be used to compute the maximum likelihood

Â estimate.

Â And in this case, we need to use a special

Â number addition technique to avoid underflow.

Â [MUSIC]

Â