[SOUND] This lecture is about query likelihood, probabilistic retrieval model. In this lecture, we continue the discussion of probabilistic retrieval model. In particular, we're going to talk about the query light holder retrieval function. In the query light holder retrieval model, our idea is model. How like their user who likes a document with pose a particular query? So in this case, you can imagine if a user likes this particular document about a presidential campaign news. Now we assume, the user would use this a document as a basis to impose a query to try and retrieve this document. So again, imagine use a process that works as follows. Where we assume that the query is generated by assembling words from the document. So for example, a user might pick a word like presidential, from this document and then use this as a query word. And then the user would pick another word like campaign, and that would be the second query word. Now this of course is an assumption that we have made about how a user would pose a query. Whether a user actually followed this process may be a different question, but this assumption has allowed us to formerly characterize this conditional probability. And this allows us to also not rely on the big table that I showed you earlier to use empirical data to estimate this probability. And this is why we can use this idea then to further derive retrieval function that we can implement with the program language. So as you see the assumption that we made here is each query word is independent of the sample. And also each word is basically obtained from the document. So now let's see how this works exactly. Well, since we are completing a query likelihood then the probability here is just the probability of this particular query, which is a sequence of words. And we make the assumption that each word is generated independently. So as a result, the probability of the query is just a product of the probability of each query word. Now how do we compute the probability of each query word? Well, based on the assumption that a word is picked from the document that the user has in mind. Now we know the probability of each word is just the relative frequency of each word in the document. So for example, the probability of presidential given the document. Would be just the count of presidential document divided by the total number of words in the document or document s. So with these assumptions we now have actually a simple formula for retrieval. We can use this to rank our documents. So does this model work? Let's take a look. Here are some example documents that you have seen before. Suppose now the query is presidential campaign and we see the formula here on the top. So how do we score this document? Well, it's very simple. We just count how many times do we have seen presidential or how many times do we have seen campaigns, etc. And we see here 44, and we've seen presidential twice. So that's 2 over the length of document 4 multiplied by 1 over the length of document 4 for the probability of campaign. And similarly, we can get probabilities for the other two documents. Now if you look at these numbers or these formulas for scoring all these documents, it seems to make sense. Because if we assume d3 and d4 have about the same length, then looks like a nominal rank d4 above d3 and which is above d2. And as we would expect, looks like it did captures a TF query state, and so this seems to work well. However, if we try a different query like this one, presidential campaign update then we might see a problem. Well what problem? Well think about the update. Now none of these documents has mentioned update. So according to our assumption that a user would pick a word from a document to generate a query, then the probability of obtaining the word update would be what? Would be 0. So that causes a problem, because it would cause all these documents to have zero probability of generating this query. Now why it's fine to have zero probability for d2, which is non-relevant? It's not okay to have 0 for d3 and d4 because now we no longer can distinguish them. What's worse? We can't even distinguish them from d2. So that's obviously not desirable. Now when a [INAUDIBLE] has such result, we should think about what has caused this problem? So we have to examine what assumptions have been made, as we derive this ranking function. Now is you examine those assumptions carefully you will realize, what has caused this problem? So take a moment to think about it. What do you think is the reason why update has zero probability and how do we fix it? So if you think about this from the moment you realize that that's because we have made an assumption that every query word must be drawn from the document in the user's mind. So in order to fix this, we have to assume that the user could have drawn a word not necessarily from the document. So that's the improved model. An improvement here is to say that, well instead of drawing a word from the document, let's imagine that the user would actually draw a word from a document model. And so I show a model here. And we assume that this document is generated using this unigram language model. Now, this model doesn't necessarily assign zero probability for update in fact, we can assume this model does not assign zero probability for any word. Now if we're thinking this way then the generation process is a little bit different. Now the user has this model in mind instead of this particular document. Although the model has to be estimated based on the document. So the user can again generate the query using a singular process. Namely, pick a word for example, presidential and another word campaign. Now the difference is that this time we can also pick a word like update, even though update doesn't occur in the document to potentially generate a query word like update. So that a query was updated 1 times 0 probabilities. So this would fix our problem. And it's also reasonable because when our thinking of what the user is looking for in a more general way, that is unique language model instead of fixed document. So how do we compute this query likelihood? If we make this sum wide involved two steps. The first one is compute this model, and we call it document language model here. For example, I've shown two pulse models here, it's major based on two documents. And then given a query like a data mining algorithms the thinking is that we'll just compute the likelihood of this query. And by making independence assumptions we could then have this probability as a product of the probability of each query word. We do this for both documents, and then we can score these two documents and then rank them. So that's the basic idea of this query likelihood retrieval function. So more generally this ranking function would look like in the following. Here we assume that the query has n words, w1 through wn, and then the scoring function. The ranking function is the probability that we observe this query, given that the user is thinking of this document. And this is assume it will be product of probabilities of all individual words. This is based on independent assumption. Now we actually often score the document before this query by using log of the query likelihood as shown on the second line. Now we do this to avoid having a lot of small probabilities, mean multiply together. And this could cause under flow and we might loose the precision by transforming the value in our algorithm function. We maintain the order of these documents yet we can avoid the under flow problem. And so if we take longer than transformation of course, the product would become a sum as you on the second line here. So the sum of all the query words inside of the sum that is one of the probability of this word given by the document. And then we can further rewrite the sum to a different form. So in the first sum here, in this sum, we have it over all the query words and query word. And in this sum we have a sum of all the possible words. But we put a counter here of each word in the query. Essentially we are only considering the words in the query, because if a word is not in the query, the count will be 0. So we're still considering only these n words. But we're using a different form as if we were going to take a sample of all the words in the vocabulary. And of course, a word might occur multiple times in the query. That's why we have a count here. And then this part is log of the probability of the word, given by the document language model. So you can see in this retrieval function, we actually know the count of the word in the query. So the only thing that we don't know is this document language model. Therefore, we have converted the retrieval problem include the problem of estimating this document language model. So that we can compute the probability of each query word given by this document. And different estimation methods would lead to different ranking functions. This is just like a different way to place document in the vector space which leads to a different ranking function in the vector space model. Here different ways to estimate will lead to a different ranking function for query likelihood. [MUSIC]