[MUSIC] This lecture is about the mixture of unigram language models. In this lecture we will continue discussing probabilistic topic models. In particular, what we introduce a mixture of unigram language models. This is a slide that you have seen earlier. Where we talked about how to get rid of the background words that we have on top of for one document. So if you want to solve the problem, it would be useful to think about why we end up having this problem. Well, this obviously because these words are very frequent in our data and we are using a maximum likelihood to estimate. Then the estimate obviously would have to assign high probability for these words in order to maximize the likelihood. So, in order to get rid of them that would mean we'd have to do something differently here. In particular we'll have to say this distribution doesn't have to explain all the words in the tax data. What were going to say is that, these common words should not be explained by this distribution. So one natural way to solve the problem is to think about using another distribution to account for just these common words. This way, the two distributions can be mixed together to generate the text data. And we'll let the other model which we'll call background topic model to generate the common words. This way our target topic theta here will be only generating the common handle words that are characterised the content of the document. So, how does this work? Well, it is just a small modification of the previous setup where we have just one distribution. Since we now have two distributions, we have to decide which distribution to use when we generate the word. Each word will still be a sample from one of the two distributions. Text data is still generating the same way. Namely, look at the generating of the one word at each time and eventually we generate a lot of words. When we generate the word, however, we're going to first decide which of the two distributions to use. And this is controlled by another probability, the probability of theta sub d and the probability of theta sub B here. So this is a probability of enacting the topic word of distribution. This is the probability of enacting the background word of distribution denoted by theta sub B. On this case I just give example where we can set both to 0.5. So you're going to basically flip a coin, a fair coin, to decide what you want to use. But in general these probabilities don't have to be equal. So you might bias toward using one topic more than the other. So now the process of generating a word would be to first we flip a coin. Based on these probabilities choosing each model and if let's say the coin shows up as head, which means we're going to use the topic two word distribution. Then we're going to use this word distribution to generate a word. Otherwise we might be going slow this path. And we're going to use the background word distribution to generate a word. So in such a case, we have a model that has some uncertainty associated with the use of a word distribution. But we can still think of this as a model for generating text data. And such a model is called a mixture model. So now let's see. In this case, what's the probability of observing a word w? Now here I showed some words. like "the" and "text". So as in all cases, once we setup a model we are interested in computing the likelihood function. The basic question is, so what's the probability of observing a specific word here? Now we know that the word can be observed from each of the two distributions, so we have to consider two cases. Therefore it's a sum over these two cases. The first case is to use the topic for the distribution to generate the word. And in such a case then the probably would be theta sub d, which is the probability of choosing the model multiplied by the probability of actually observing the word from that model. Both events must happen in order to observe. We first must have choosing the topic theta sub d and then, we also have to actually have sampled the word the from the distribution. And similarly, the second part accounts for a different way of generally the word from the background. Now obviously the probability of text the same is all similar, right? So we also can see the two ways of generating the text. And in each case, it's a product of the probability of choosing a particular word is multiplied by the probability of observing the word from that distribution. Now whether you will see, this is actually a general form. So might want to make sure that you have really understood this expression here. And you should convince yourself that this is indeed the probability of obsolete text. So to summarize what we observed here. The probability of a word from a mixture model is a general sum of different ways of generating the word. In each case, it's a product of the probability of selecting that component model. Multiplied by the probability of actually observing the data point from that component of the model. And this is something quite general and you will see this occurring often later. So the basic idea of a mixture model is just to retrieve thesetwo distributions together as one model. So I used a box to bring all these components together. So if you view this whole box as one model, it's just like any other generative model. It would just give us the probability of a word. But the way that determines this probability is quite the different from when we have just one distribution. And this is basically a more complicated mixture model. So the more complicated is more than just one distribution. And it's called a mixture model. So as I just said we can treat this as a generative model. And it's often useful to think of just as a likelihood function. The illustration that you have seen before, which is dimmer now, is just the illustration of this generated model. So mathematically, this model is nothing but to just define the following generative model. Where the probability of a word is assumed to be a sum over two cases of generating the word. And the form you are seeing now is a more general form that what you have seen in the calculation earlier. Well I just use the symbol w to denote any water but you can still see this is basically first a sum. Right? And this sum is due to the fact that the water can be generated in much more ways, two ways in this case. And inside a sum, each term is a product of two terms. And the two terms are first the probability of selecting a component like of D Second, the probability of actually observing the word from this component of the model. So this is a very general description of all the mixture models. I just want to make sure that you understand this because this is really the basis for understanding all kinds of on top models. So now once we setup model. We can write down that like functioning as we see here. The next question is, how can we estimate the parameter, or what to do with the parameters. Given the data. Well, in general, we can use some of the text data to estimate the model parameters. And this estimation would allow us to discover the interesting knowledge about the text. So you, in this case, what do we discover? Well, these are presented by our parameters and we will have two kinds of parameters. One is the two worded distributions, that result in topics, and the other is the coverage of each topic in each. The coverage of each topic. And this is determined by probability of C less of D and probability of theta, so this is to one. Now, what's interesting is also to think about special cases like when we send one of them to want what would happen? Well with the other, with the zero right? And if you look at the likelihood function, it will then degenerate to the special case of just one distribution. Okay so you can easily verify that by assuming one of these two is 1.0 and the other is Zero. So in this sense, the mixture model is more general than the previous model where we have just one distribution. It can cover that as a special case. So to summarize, we talked about the mixture of two Unigram Language Models and the data we're considering here is just One document. And the model is a mixture model with two components, two unigram LM models, specifically theta sub d, which is intended to denote the topic of document d, and theta sub B, which is representing a background topic that we can set to attract the common words because common words would be assigned a high probability in this model. So the parameters can be collectively called Lambda which I show here you can again think about the question about how many parameters are we talking about exactly. This is usually a good exercise to do because it allows you to see the model in depth and to have a complete understanding of what's going on this model. And we have mixing weights, of course, also. So what does a likelihood function look like? Well, it looks very similar to what we had before. So for the document, first it's a product over all the words in the document exactly the same as before. The only difference is that inside here now it's a sum instead of just one. So you might have recalled before we just had this one there. But now we have this sum because of the mixture model. And because of the mixture model we also have to introduce a probability of choosing that particular component of distribution. And so this is just another way of writing, and by using a product over all the unique words in our vocabulary instead of having that product over all the positions in the document. And this form where we look at the different and unique words is a commutative that formed for computing the maximum likelihood estimate later. And the maximum likelihood estimator is, as usual, just to find the parameters that would maximize the likelihood function. And the constraints here are of course two kinds. One is what are probabilities in each [INAUDIBLE] must sum to 1 the other is the choice of each [INAUDIBLE] must sum to 1. [MUSIC]