0:07

This lecture is about that Latent Dirichlet Allocation or LDA.

In this lecture, we are going to continue talking about topic models.

In particular, we are going to talk about some extension of PLSA,

and one of them is LDA or Latent Dirichlet Allocation.

So the plan for this lecture is to cover two things.

One is to extend the PLSA with prior knowledge and that

would allow us to have in some sense a user-controlled PLSA,

so it doesn't apply to they just listen to data,

but also would listen to our needs.

The second is to extend the PLSA as a generative model,

a fully generative model.

This has led to the development of Latent Dirichlet Allocation or LDA.

So first, let's talk about the PLSA with prior knowledge.

Now in practice, when we apply PLSA to analyze text data,

we might have additional knowledge that we want to inject to guide the analysis.

The standard PLSA is going to blindly listen to the data by using maximum [inaudible].

We are going to just fit data as much as we can and get some insight about data.

This is also very useful,

but sometimes a user might have some expectations about which topics to analyze.

For example, we might expect to see retrieval models as a topic in

information retrieval or we also may be interesting in certain aspects,

such as battery and memory,

when looking at opinions about a laptop because

the user is particularly interested in these aspects.

A user may also have knowledge about topic coverage and we may

know which topic is definitely not covering which document or is covering the document.

For example, we might have seen those tags,

topic tags assigned to documents.

And those tags could be treated as topics.

If we do that then a document account will be generated using

topics corresponding to the tags already assigned to the document.

If the document is not assigned a tag,

we're going to say there is no way for using that topic to generate document.

The document must be generated by using the topics corresponding to that assigned tags.

So question is how can we incorporate such knowledge into PLSA.

It turns out that there is a very elegant way of doing

that and that would incorporate such knowledge as priors on the models.

And you may recall in Bayesian inference,

we use prior together with data

to estimate parameters and this is precisely what would happen.

So in this case,

we can use maximum

a posteriori estimate also called MAP estimate and the formula is given here.

Basically, this is to maximize the posteriori distribution probability.

And this is a combination of the likelihood of data and the prior.

So what would happen is that we are going to have an estimate

that listens to the data and also listens to our prior preferences.

We can use this prior which is denoted as p of lambda

to encode all kinds of preferences and the constraints.

So for example, we can use this to encode

the need of having precise background of the topic.

Now this could be encoded as a prior because we can say the prior for the parameters is

only a non-zero if

the parameters contain one topic that is equivalent to the background language model.

In other words, in other cases if it is not like that,

we are going to say the prior says it is impossible.

So the probability of that kind of models I think would be zero according to our prior.

So now we can also for example use the prior to

force particular choice of topic to have a probability of a certain number.

For example, we can force document D to choose topic one with probability of

one half or we can prevent topic from being used in generating document.

So we can say the third topic should not be used in generating document D,

we will set to the Pi zero for that topic.

We can also use the prior to favor a set of parameters

with topics that assign high probability to some particular words.

In this case, we are not going to say it is impossible but we can just strongly

favor certain kind of distributions and you will see example later.

The MAP can be computed using a similar EM algorithm as we have used

for the maximum likelihood estimate.

With just some modifications,

most of the parameters would reflect the prior preferences and in

such an estimate if we use a special form of the prior code or conjugate the prior,

then the functional form of the prior will be similar to the data.

As a result, we can combine the two and the consequence is

that you can basically convert the inference of the prior

into the inference of having additional pseudo data

because the two functional forms are the same and they can be combined.

So the effect is as if we had more data and this is convenient for computation.

It does not mean conjugate prior is the best way to define prior.

So now let us look at the specific example.

Suppose the user is particularly interested in

battery life of a laptop and we are analyzing reviews.

So the prior says that the distribution should contain

one distribution that would assign high probability to battery and life.

So we could say well there is distribution that is kind of concentrated on

battery life and prior says that one of distributions should be very similar to this.

Now if we use MAP estimate with conjugate prior,

which is the original prior,

the original distribution based on this preference,

then the only difference in the EM is that when we re-estimate words distributions,

we are going to add additional counts to reflect our prior.

So here you can see the pseudo counts

are defined based on the probability of words in a prior.

So battery obviously would have

high pseudo counts and similarly life would have also high pseudo counts.

All the other words would have zero pseudo counts because their probability is

zero in the prior and we see this is also controlled by

a parameter mu and we are going to add a mu much by the probability of W given

prior distribution to the connected accounts when we

re-estimate this word distribution.

So this is the only step that is changed and the change is happening here.

And before we just connect the counts of words that

we believe have been generated from this topic but now we

force this distribution to give more probabilities

to these words by adding them to the pseudo counts.

So in fact we artificially inflated their probabilities.

To make this distribution,

we also need to add this many pseudo counts to the denominator.

This is total sum of all the pseudo counts we have added

for all the words This would make this a gamma distribution.

Now this is intuitively very reasonable way of modifying EM and theoretically speaking,

this works and it computes the MAP estimate.

It is useful to think about the two specific extreme cases of mu.

Now, [inaudible] the picture.

Think about what would happen if we set mu to zero.

Well that essentially to remove this prior.

So mu in some sense indicates our strengths on prior.

Now what would happen if we set mu to positive infinity.

Well that is to say that prior is so

strong that we are not going to listen to the data at all.

So in the end, you see in this case

we are going to make one of the distributions fixed to the prior. You see why?

When mu is infinitive, we basically let this one dominate.

In fact we are going to set this one to precise this distribution.

So in this case, it is this distribution.

And that is why we said

the background language model is in fact a way to impose the prior

because it would force one distribution to be exactly the same as what we give,

that is background distribution.

So in this case, we can even force the distribution to entirely focus on battery life.

But of course this would not work well because it cannot attract other words.

It would affect the accuracy of counting topics about battery life.

So in practice, mu is set somewhere in between of course.

So this is one way to impose a prior.

We can also impose some other constraints.

For example, we can set any parameters that will constantly include zero as needed.

For example, we may want to set one of the Pi's to

zero and this would mean

we do not allow that topic to participate in generating that document.

And this is only reasonable of course when we

have prior analogy that strongly suggests this.