[SOUND] This lecture is about the feedback in the language modeling approach. In this lecture, we will continue the discussion of feedback in text retrieval. In particular, we're going to talk about the feedback in language modeling approaches. So we derive the query likelihood ranking function by making various assumptions. As a basic retrieval function, all those formulas worked well. But if we think about the feedback information, it's a little bit awkward to use query likelihood to perform feedback, because a lot of times the feedback information is additional information about the query. But we assume the query has generated it by assembling words from a language model in the query likelihood method. It's kind of unnatural to sample words that form feedback documents. As a result, researchers proposed a way to generalize query likelihood function, and it's called Kullback-Leibler divergence retrieval model. And this model is actually going to make the query likelihood retrieval function much closer to vector space model. Yet this form of the language model can be regarded as a generalization of query likelihood, in the sense that it can cover query likelihood as a special case. And in this case, then feedback can be achieved through simply query model estimation or updating. This is very similar to Rocchio, which updates the query vector. So let's see what is this KL-divergence retrieval model. So on the top, what you see is a query likelihood retrieval function, this one. And then KL-divergence, or also called cross entropy, retrieval model is basically to generalize the frequency part here into a language model. So basically it's the difference given by the probabilistic model here to characterize what the user is looking for, versus the count of query words there. And this difference allows us to plug in various different ways to estimate this. So this can be estimated in many different ways, including using feedback information. But this is called a KL-divergence, because this can be interpreted as matching the KL-divergence of two distributions. One is the query model, denoted by this distribution. One is the document language model here and smooth them with a collection language model, of course. And we are not going to talk about the detail of that, and you'll find it in some references. It's also called cross entropy because, in fact, we ignore some terms in the KL-divergence function and we will end up having actually cross entropy. And both are terms of information theory. But anyway, for our purposes here, you can just see the two formulas look almost identical, except that here we have a probability of a word given by a query language model. And here the sum is over all the words that are in the document and also with the nonzero probability for the query model. So it's kind of, again, a generalization of sum over all the matching query words. Now you can also easily see we can recover the query likelihood retrieval function by simply setting this query model to the relative frequency of a word in the query. This is very easy to see once you plug this into here you can eliminate this query length as a constant. And then you will get exactly like that. So you can see the equivalence. And that's also why this KL-divergence model can be regarded as a generalization of query likelihood, because we can cover query likelihood as a special case. But it would also allow us to do much more than that. So this is how we can use the KL-divergence model to then do feedback. The picture shows that we first estimate a document language model, then we estimate a query language model, and we compute the KL-divergence. This is often denoted by a D here. But this basically means this is exactly like the vector space model, because we compute a vector for the document, then compute another vector for the query, and then we compute the distance. Only that these vectors are of special forms, they are probability distributions. And then we get the results and we can find some feedback documents. Let's assume they are mostly positive documents, although we could also consider both kinds of documents. So what we could do is, like in Rocchio, we're going to compute another language model called the feedback language model here. Again, this is going to be another vector just like the computing centroid of vector in Rocchio. And then this model can be combined with the original query model using a linear interpolation, and this would then give us an update model, just like, again, in Rocchio. So here we can see the parameter alpha can control the amount of feedback. If it's set to zero, then essentially there is no feedback. If it's set to one, we get full feedback and we ignore the original query. And this is generally not desirable, right? So unless you are absolutely sure you have seen a lot of relevant documents, then the query terms are not important. So of course, the main question here is, how do you compute this theta F? This is the big question here, and once you can do that, the rest is easy. So here we will talk about one of the approaches, and there are many approaches, of course. This approach is based on generative model, and I'm going to show you how it works. This will use a generative mixture model. So this picture shows that we have this model here, the feedback model that we want to estimate. And the basis is the feedback documents. Let's say we are observing the positive documents. These are the clicked documents by users or random documents judged by users, or are simply top ranked documents that we assume to be relevant. Now imagine how we can compute a centroid for these documents by using language model. One approach is simply to assume these documents are generated from this language model. As we did before, what we could do is just normalize the word frequency here to here and then we will get this word distribution. Now the question is whether this distribution is good for feedback. Well, you can imagine the top ranked word would be what? What do you think? Well, those words would be common words. As we always see in a language model, the top ranked words are actually common words like the, a, etc. So it's not very good for feedback, because we would be adding a lot of such words to our query when we interpolate this with the original query model. So this was not good, so we need to do something. In particular, we are trying to get rid of those common words. And we have seen actually one way to do that by using background language model in the case of learning the associations of words, the words that are related to the word computer. We could do that and that would be another way to do this, but here we are going to talk about another approach which is a more principled approach. In this case, we're going to say well, you said that there are common words here in these documents that should not belong to this topic model, right? So now what we can do is to assume that, well, those words are generated from background language model, so they will generate those words like the, for example. And if we use maximum likelihood estimate, note that if all the words here must be generated from this model, then this model is forced to assign high probabilities to a word like the, because it occurs so frequently here. Note that in order to reduce its probability in this model, we have to have another model, which is this one, to help explain the word the here. And in this case, it's not appropriate to use the background language model to achieve this goal because this model would assign high probabilities to these common words. So in this approach, then, we assume this machine that was generating these words would work as follows. We have a source control up here. Imagine we flip a coin here to decide what distribution to use. With probability of lambda, the coin shows up as head and we're going to use the background language model. And we're going to do that in sample word from that model. With probability of 1 minus lambda, we're going to decide to use a known topic model, here, that we would like to estimate. And we're going to then generate a word here. If we make this assumption and this whole thing will be just one model, and we call this a mixture model because there are two distributions that are mixed together. And we actually don't know when each distribution is used. So again, think of this whole thing as one model, and we can still ask for words and it will still give us a word in a random manner. And of course, which word will show up will depend on both this distribution and that distribution. In addition, it would also depend on this lambda, because if you say lambda is very high and it's going to always use the background distribution, you will get different words. Then if you say, well, lambda is very small, we're going to use this. So all of these are parameters in this model. And then if you're thinking this way, basically we can do exactly the same as what we did before. We're going to use maximum likelihood estimator to adjust this model, to estimate the parameters. Basically we're going to adjust this parameter so that we can best explain all the data. The difference now is that we are not asking this model a known to explain this. But rather we are going to ask this whole model, mixture model, to explain the data. Because it has got some help from the background model, it doesn't have to assign high probabilities to words like the. As a result, it will then assign higher probabilities to other words that are common here but not having high probability here. So those would be common here. And if they're common, they would have to have high probabilities, according to a maximum likelihood estimate method. And if they are rare here, then you don't get much help from this background model. As a result, this topic model must assign high probabilities. So the high probability words, according to the topic model, would be those that are common here but rare in the background. So this is basically a little bit like an idea of weighting here. But this would allow us to achieve the effect of removing these topic words that are meaningless in the feedback. So mathematically, what we have is to compute the likelihood, again, local likelihood, of the feedback documents. And note that we also have another parameter, lambda here, but we assume that the lambda denotes the noise in the feedback document. So we are going to, let's say set this to a parameter. Let's say 50% of the words are noise or 90% are noise. And this can then be assumed it will be fixed. If we assume this is fixed, then we only have these probabilities as parameters, just like in the simple unigram language model. We have n parameters, n is the number of words. And then the likelihood function would look like this. It's very similar to the global likelihood function we see before, except that inside the logarithm there's a sum here. And this sum is because we consider two distributions. And which one is used would depend on lambda, and that's why we have this form. But mathematically, this is the function with theta as unknown variables. So this is just a function. All the other values are known except for this guy. So we can then choose this probability distribution to maximize this log likelihood, the same idea as the maximum likelihood estimate as a mathematical problem. We just have to solve this optimization problem. We essentially would try all the theta values until we find one that gives this whole thing the maximum probability. So it's a well-defined math problem. Once we have done that, we obtain this theta F that can then be interpolated with original query model to the feedback. So here are some examples of the feedback model learned from a web document collection. And we do pseudo-feedback we just use the top ten documents and we use this mixture model. So the query is airport security. What we do is we first retrieve ten documents from the web database and this is of course pseudo-feedback. And then we're going to feed that mixture model to this ten document set. And these are the words learned using this approach. This is the probability of a word given by the feedback model in both cases. So in both cases you can see the highest probability words include the very relevant words to the query. So airport security, for example, these query words still show up as high probabilities in each case naturally, because they occur frequently in the top ranked documents. But we also see beverage, alcohol, bomb, terrorist, etc. So these are relevant to this topic, and they, if combined with original query, can help us much more accurately on documents. And also they can help us bring up documents that only mention some of these other words, maybe, for example, just airport and then bomb, for example. So this is how pseudo-feedback works. It shows that this model really works and picks up some related words to the query. What's also interesting is that if you look at the two tables here and you compare them, then you'll see, in this case, when lambda is set to a small value, then we'll see some common words here. And that means, well, we don't use the background model often. Remember, lambda confuses the probability of using background model to generate the text. If we don't rely much on background model, we still have to use this topic model to account for the common words. Whereas if we set lambda to a very high value, we will use the background model very often to explain these words. Then there's no burden on expanding those common words in the feedback documents by the topic model. So as a result, the topic model here is very discriminative. It contains all the relevant words without common words. So this can be added to the original query to achieve feedback. So to summarize, in this lecture we have talked about the feedback in language model approach. In general, feedback is to learn from examples. These examples can be assumed examples, can be pseudo-examples, like assume the top ten documents that are assumed to be relevant. They could be based on user interactions, like feedback based on clickthroughs or implicit feedback. We talked about the three major feedback scenarios, relevance feedback, pseudo feedback, and implicit feedback. We talked about how to use Rocchio to do feedback in vector space model and how to use query model estimation for feedback in language model. And we briefly talked about the mixture model and the basic idea. There are many other methods. For example, the relevance model is a very effective model for estimating query model. So you can read more about these methods in the references that are listed at the end of this lecture. So there are two additional readings here. The first one is a book that has a systematic review and discussion of language models for information retrieval. And the second one is a important research paper that's about relevance based language models, and it's a very effective way of computing query model. [MUSIC]