0:07

This lecture is about that Latent Dirichlet Allocation or LDA.

Â In this lecture, we are going to continue talking about topic models.

Â In particular, we are going to talk about some extension of PLSA,

Â and one of them is LDA or Latent Dirichlet Allocation.

Â So the plan for this lecture is to cover two things.

Â One is to extend the PLSA with prior knowledge and that

Â would allow us to have in some sense a user-controlled PLSA,

Â so it doesn't apply to they just listen to data,

Â but also would listen to our needs.

Â The second is to extend the PLSA as a generative model,

Â a fully generative model.

Â This has led to the development of Latent Dirichlet Allocation or LDA.

Â So first, let's talk about the PLSA with prior knowledge.

Â Now in practice, when we apply PLSA to analyze text data,

Â we might have additional knowledge that we want to inject to guide the analysis.

Â The standard PLSA is going to blindly listen to the data by using maximum [inaudible].

Â We are going to just fit data as much as we can and get some insight about data.

Â This is also very useful,

Â but sometimes a user might have some expectations about which topics to analyze.

Â For example, we might expect to see retrieval models as a topic in

Â information retrieval or we also may be interesting in certain aspects,

Â such as battery and memory,

Â when looking at opinions about a laptop because

Â the user is particularly interested in these aspects.

Â A user may also have knowledge about topic coverage and we may

Â know which topic is definitely not covering which document or is covering the document.

Â For example, we might have seen those tags,

Â topic tags assigned to documents.

Â And those tags could be treated as topics.

Â If we do that then a document account will be generated using

Â topics corresponding to the tags already assigned to the document.

Â If the document is not assigned a tag,

Â we're going to say there is no way for using that topic to generate document.

Â The document must be generated by using the topics corresponding to that assigned tags.

Â So question is how can we incorporate such knowledge into PLSA.

Â It turns out that there is a very elegant way of doing

Â that and that would incorporate such knowledge as priors on the models.

Â And you may recall in Bayesian inference,

Â we use prior together with data

Â to estimate parameters and this is precisely what would happen.

Â So in this case,

Â we can use maximum

Â a posteriori estimate also called MAP estimate and the formula is given here.

Â Basically, this is to maximize the posteriori distribution probability.

Â And this is a combination of the likelihood of data and the prior.

Â So what would happen is that we are going to have an estimate

Â that listens to the data and also listens to our prior preferences.

Â We can use this prior which is denoted as p of lambda

Â to encode all kinds of preferences and the constraints.

Â So for example, we can use this to encode

Â the need of having precise background of the topic.

Â Now this could be encoded as a prior because we can say the prior for the parameters is

Â only a non-zero if

Â the parameters contain one topic that is equivalent to the background language model.

Â In other words, in other cases if it is not like that,

Â we are going to say the prior says it is impossible.

Â So the probability of that kind of models I think would be zero according to our prior.

Â So now we can also for example use the prior to

Â force particular choice of topic to have a probability of a certain number.

Â For example, we can force document D to choose topic one with probability of

Â one half or we can prevent topic from being used in generating document.

Â So we can say the third topic should not be used in generating document D,

Â we will set to the Pi zero for that topic.

Â We can also use the prior to favor a set of parameters

Â with topics that assign high probability to some particular words.

Â In this case, we are not going to say it is impossible but we can just strongly

Â favor certain kind of distributions and you will see example later.

Â The MAP can be computed using a similar EM algorithm as we have used

Â for the maximum likelihood estimate.

Â With just some modifications,

Â most of the parameters would reflect the prior preferences and in

Â such an estimate if we use a special form of the prior code or conjugate the prior,

Â then the functional form of the prior will be similar to the data.

Â As a result, we can combine the two and the consequence is

Â that you can basically convert the inference of the prior

Â into the inference of having additional pseudo data

Â because the two functional forms are the same and they can be combined.

Â So the effect is as if we had more data and this is convenient for computation.

Â It does not mean conjugate prior is the best way to define prior.

Â So now let us look at the specific example.

Â Suppose the user is particularly interested in

Â battery life of a laptop and we are analyzing reviews.

Â So the prior says that the distribution should contain

Â one distribution that would assign high probability to battery and life.

Â So we could say well there is distribution that is kind of concentrated on

Â battery life and prior says that one of distributions should be very similar to this.

Â Now if we use MAP estimate with conjugate prior,

Â which is the original prior,

Â the original distribution based on this preference,

Â then the only difference in the EM is that when we re-estimate words distributions,

Â we are going to add additional counts to reflect our prior.

Â So here you can see the pseudo counts

Â are defined based on the probability of words in a prior.

Â So battery obviously would have

Â high pseudo counts and similarly life would have also high pseudo counts.

Â All the other words would have zero pseudo counts because their probability is

Â zero in the prior and we see this is also controlled by

Â a parameter mu and we are going to add a mu much by the probability of W given

Â prior distribution to the connected accounts when we

Â re-estimate this word distribution.

Â So this is the only step that is changed and the change is happening here.

Â And before we just connect the counts of words that

Â we believe have been generated from this topic but now we

Â force this distribution to give more probabilities

Â to these words by adding them to the pseudo counts.

Â So in fact we artificially inflated their probabilities.

Â To make this distribution,

Â we also need to add this many pseudo counts to the denominator.

Â This is total sum of all the pseudo counts we have added

Â for all the words This would make this a gamma distribution.

Â Now this is intuitively very reasonable way of modifying EM and theoretically speaking,

Â this works and it computes the MAP estimate.

Â It is useful to think about the two specific extreme cases of mu.

Â Now, [inaudible] the picture.

Â Think about what would happen if we set mu to zero.

Â Well that essentially to remove this prior.

Â So mu in some sense indicates our strengths on prior.

Â Now what would happen if we set mu to positive infinity.

Â Well that is to say that prior is so

Â strong that we are not going to listen to the data at all.

Â So in the end, you see in this case

Â we are going to make one of the distributions fixed to the prior. You see why?

Â When mu is infinitive, we basically let this one dominate.

Â In fact we are going to set this one to precise this distribution.

Â So in this case, it is this distribution.

Â And that is why we said

Â the background language model is in fact a way to impose the prior

Â because it would force one distribution to be exactly the same as what we give,

Â that is background distribution.

Â So in this case, we can even force the distribution to entirely focus on battery life.

Â But of course this would not work well because it cannot attract other words.

Â It would affect the accuracy of counting topics about battery life.

Â So in practice, mu is set somewhere in between of course.

Â So this is one way to impose a prior.

Â We can also impose some other constraints.

Â For example, we can set any parameters that will constantly include zero as needed.

Â For example, we may want to set one of the Pi's to

Â zero and this would mean

Â we do not allow that topic to participate in generating that document.

Â And this is only reasonable of course when we

Â have prior analogy that strongly suggests this.

Â