[MUSIC] This lecture is about how to evaluate the text retrieval system when we have multiple levels of judgements. In this lecture, we will continue the discussion of evaluation. We're going to look at how to evaluate a text retrieval system, when we have multiple levels of judgements. So far we have talked about the binary judgements, that means a document is judged as being relevant or not relevant. But earlier, we also talk about the relevance as a medal of degrees. So we often can distinguish very high relevant documents, those are very useful documents, from moderately relevant documents. They are okay, they are useful perhaps. And further from now, we're adding the documents, those are not useful. So imagine you can have ratings for these pages. Then, you would have multiple levels of ratings. For example, here I show example of three levels, 3 for relevant, sorry 3 for very relevant, 2 for marginally relevant, and 1 for non-relevant. Now, how do we evaluate the search engine system using these judgements? Obvious that the map doesn't work, average of precision doesn't work, precision, and recall doesn't work, because they rely on binary judgements. So let's look at some top ranked results when using these judgements. Imagine the user would be mostly care about the top ten results here. And we marked the rating levels, or relevance levels, for these documents as shown here, 3, 2, 1, 1, 3, etcetera. And we call these gain. And the reason why we call it the gain is because the measure that we are infusing is called the NDCG normalized or accumulated gain. So this gain, basically, can measure how much a gain of random information a user can obtain by looking at each document, right? So looking at the first document, the user can gain 3 points. Looking at the non-relevant document user would only gain 1 point. Looking at the moderator or marginally relevant, document the user would get 2 points, etcetera. So, this gain to each of the measures is a utility of the document from a user's perspective. Of course, if we assume the user stops at the 10 documents and we're looking at the cutoff at 10, we can look at the total gain of the user. And what's that? Well, that's simply the sum of these, and we call it the Cumulative Gain. So if the user stops after the position 1, that's just a 3. If the user looks at another document, that's a 3+2. If the user looks at the more documents, then the cumulative gain is more. Of course this is at the cost of spending more time to examine the list. So cumulative gain gives us some idea about how much total gain the user would have if the user examines all these documents. Now, in NDCG, we also have another letter here, D, discounted cumulative gain. So, why do we want to do discounting? Well, if you look at this cumulative gain, there is one deficiency, which is it did not consider the rank position of these documents. So for example, looking at this sum here, and we only know there is 1 highly relevant document, 1 marginally relevant document, 2 non-relevant documents. We don't really care where they are ranked. Ideally, we want these two to be ranked on the top which is the case here. But how can we capture that intuition? Well we have to say, well this is 3 here is not as good as this 3 on the top. And that means the contribution of the gain from different positions has to be weighted by their position. And this is the idea of discounting, basically. So we're going to to say, well, the first one does not need to be discounted because the user can be assumed that will always see this document. But the second one, this one will be discounted a little bit because there's a small possibility that the user wouldn't notice it. So we divide this gain by a weight based on the position. So log of 2, 2 is the rank position of this document. And when we go to the third position, we discounted even more, because the normalizer is log of 3, and so on and so forth. So when we take such a sum that a lower ranked document would not contribute that much as a highly ranked document. So that means if you, for example, switch the position of this, let's say this position, and this one, and then you would get more discount if you put, for example very relevant document here as opposed to here. Imagine if you put the 3 here, then it would have to be discounted. So it's not as good as if you we would put the 3 here. So this is the idea of discounting. Okay, so now at this point that we have got a discounted cumulative gain for measuring the utility of this ranked list with multiple levels of judgements. So are we happy with this? Well, we can use this to rank systems. Now, we still need to do a little bit more in order to make this measure comparable across different topics. And this is the last step, and by the way, here we just show the DCG at 10, so this is the total sum of DCG, all these 10 documents. So the last step is called N, normalization. And if we do that, then we'll get the normalized DCG. So how do we do that? Well, the idea here is we're going to normalize DCG by the ideal DCG at the same cutoff. What is the ideal DCG? Well, this is the DCG of an ideal ranking. So imagine if we have 9 documents in the whole collection rated 3 here. And that means in total we have 9 documents rated 3. Then our ideal rank lister would have put all these 9 documents on the very top. So all these would have to be 3 and then this would be followed by a 2 here. Because that's the best we could do after we have run out of the 3. But all these positions would be 3. Right? So this would our ideal ranked list. And then we had computed the DCG for this ideal rank list. So this would be given by this formula that you see here. And so this ideal DCG would then be used as the normalizer DCG. So here. And this idea of DCG would be used as a normalizer. So you can imagine now, normalization essentially is to compare the actual DCG with the best DCG you can possibly get for this topic. Now why do we want to do this? Well, by doing this we'll map the DCG values into a range of 0 through 1. So the best value, or the highest value, for every query would be 1. That's when your rank list is, in fact, the ideal list but otherwise, in general, you will be lower than one. Now, what if we don't do that? Well, you can see, this transformation, or this normalization, doesn't really affect the relative comparison of systems for just one topic, because this ideal DCG is the same for all the systems, so the ranking of systems based on only DCG would be exactly the same as if you rank them based on the normalized DCG. The difference however is when we have multiple topics. Because if we don't do normalization, different topics will have different scales of DCG. For a topic like this one, we have 9 highly relevant documents, the DCG can get really high, but imagine in another case, there are only two very relevant documents in total in the whole collection. Then the highest DCG that any system could achieve for such a topic would not be very high. So again, we face the problem of different scales of DCG values. When we take an average, we don't want the average to be dominated by those high values. Those are, again, easy queries. So, by doing the normalization, we can have avoided the problem, making all the queries contribute to equal to the average. So, this is a idea of NDCG, it's used for measuring a rank list based on multiple level of relevance judgements. In a more general way this is basically a measure that can be applied to any ranked task with multiple level of judgements. And the scale of the judgements can be multiple, can be more than binary not only more than binary they can be much multiple levels like 1, 0, 5 or even more depending on your application. And the main idea of this measure, just to summarize, is to measure the total utility of the top k documents. So you always choose a cutoff and then you measure the total utility. And it would discount the contribution from a lowly ranked document. And then finally, it would do normalization to ensure comparability across queries. [MUSIC]