[SOUND] This lecture is about Document Length Normalization in the Vector Space Model. In this lecture, we will continue the discussion of the vector space model. In particular, we're going to discuss the issue of document length normalization. So far in the lectures about the vector space model, we have used the various signals from the document to assess the matching of the document with a query. In particular, we have considered the tone frequency. The count of a tone in a document. We have also considered it's global statistics such as, IDF, Inverse Document Frequency. But we have not considered document lengths. So here I show two example documents, d4 is much shorter with only 100 words. D6 on the other hand, has a 5000 words. If you look at the matching of these query words, we see that in d6, there are more matchings of the query words. But one might reason that, d6 may have matched these query words in a scattered manner. So maybe the topic of d6, is not really about the topic of the query. So, the discussion of the campaign at the beginning of the document, may have nothing to do with the managing of presidential at the end. In general, if you think about the long documents, they would have a higher chance for matching any query. In fact, if you generate a long document randomly by assembling words from a distribution of words, then eventually you probably will match an inquiry. So in this sense, we should penalize on documents because they just naturally have better chance matching to any query, and this is idea of document normalization. We also need to be careful in avoiding to over penalize long documents. On the one hand, we want to penalize the long document. But on the other hand, we also don't want to over-penalize them. Now, the reasoning is because a document that may be long because of different reasons. In one case, the document may be long because it uses more words. So for example, think about the vortex article on the research paper. It would use more words than the corresponding abstract. So, this is a case where we probably should penalize the matching of long documents such as a full paper. When we compare the matching of words in such a long document with matching of the words in the shop abstract. Then long papers in general, have a higher chance of matching clearer words, therefore, we should penalize them. However, there is another case when the document is long, and that is when the document simply has more content. Now consider another case of long document, where we simply concatenate a lot of abstracts of different papers. In such a case, obviously, we don't want to over-penalize such a long document. Indeed, we probably don't want to penalize such a document because it's long. So that's why, we need to be careful about using the right degree of penalization. A method of that has been working well, based on recent results, is called a pivoted length normalization. And in this case, the idea is to use the average document length as a pivot, as a reference point. That means we'll assume that for the average length documents, the score is about right so the normalizer would be 1. But if the document is longer than the average document length, then there will be some penalization. Whereas if it's a shorter, then there is even some reward. So this is illustrated at using this slide, on the axis, x-axis you can see the length of document. On the y-axis, we show the normalizer. In this case, the Pivoted Length Normalization formula for the normalizer, is seeing to be interpolation of 1 and the normalize the document in length controlled by a parameter B here. So you can see here, when we first divide the length of the document by the average documents, this not only gives us some sense about how this document is compared with average documents, but also gives us a benefit of not worrying about the unit of length. We can measure the length by words or by characters. Anyway, this normalizer has interesting property. First we see that, if we set the parameter b to 0 then the value would be 1. So, there's no lens normalization at all. So, b, in this sense, controls the lens normalization. Whereas, if we set b to a nonzero value, then the normalizer would look like this. All right, so the value would be higher for documents that are longer than the average document lens. Whereas, the value of the normalizer would be shorter, would be smaller for shorter documents. So in this sense, we see there is a penalization for long documents, and there's a reward for short documents. The degree of penalization is controlled by b, because if we set b to a larger value, then the normalizer would look like this. There's even more penalization for long documents and more reward for the short documents. By adjusting b, which varies from 0 to 1, we can control the degree of length normalization. So, if we plug in this length normalization fact that into the vector space model, ranking functions is that we have already examined them. Then we will end up having the following formulas. And these are in fact the state of the vector space model formulas. Let's take a look at each of them. The first one is called a pivoted length normalization vector space model, and a reference in [INAUDIBLE] duration of this model. And here we see that, it's basically a TFI model that we have discussed, the idea of component should be very familiar to you. There is also a query term frequency component here. And then, in the middle, there is the normalizer tf and in this case, we see we use the double logarithm as we discussed before and this is to achieve a sublinear transformation. But we also put a document the length normalizer in the bottom. Right, so this would cause penalization for long document, because the larger the denominator is, then the smaller the is. And this is of course controlled by the parameter b here. And you can see again, if b is set to 0 then there is no length normalization. Okay, so this is one of the two most effective at these base model formulas. The next one called a BM25 or Okapi, is also similar in that it also has a IDF component here, and query IDF component here. But in the middle, the normal issue's a little bit different. As we explained, there is our copy tf transformation here, and that does sublinear transformation with the upper bound. In this case we have put the length normalization factor here. We're adjusting k but it achieves a similar factor, because we put a normalizer in the denominator. Therefore, again, if a document is longer then the term weight will be smaller. So you can see after we have gone through all the n answers that we talked about, and we have in the end reached the basically the state of god functions. So, So far, we have talked about mainly how to place the document vector in the vector space. And, this has played an important role in determining the effectiveness of the simple function. But there are also other dimensions, where we did not really examine details. For example, can we further improve the instantiation of the dimension of the Vector Space Model? Now, we've just assumed that the bag of words representation should issue dimension as a word but obviously, we can see there are many other choices. For example, a stemmed word, those are the words that haven't transformed into the same root form, so that computation and computing were all become the same and they can be match. We get those stop word removal. This is to remove some very common words that don't carry any content like the off. We get use of phrases to define dimensions. We can even use later in the semantical analysis, it will find some clusters of words that represent the a late in the concept as one by an engine. We can also use smaller unit, like a character end grams those are sequences of and the characters for dimensions. However, in practice, people have found that the bag-of-words representation with phrases is still the most effective one and it's also efficient. So, this is still so far the most popular dimension instantiation method. And it's used in all major search engines. I should also mention, that sometimes we need to do language specific and domain specific tokenization. And this is actually very important, as we might have variations of terms that might prevent us from matching them with each other, even when they mean the same thing. In some languages like Chinese, there is also the challenge in segmenting text to obtain word band rates because it's just a sequence of characters. A word might correspond to one character or two characters or even three characters. So, it's easier in English when we have a space to separate the words. In some other languages, we may need to do some Americanize processing to figure a way out of what are the boundaries for words. There is also the possibility to improve the similarity of the function. And so far we have used as a top product, but one can imagine there are other measures. For example, we can measure the cosine of the angle between two vectors. Or we can use Euclidean distance measure. And these are all possible, but dot product seems still the best and one reason is because it's very general. In fact that it's sufficiently general, if you consider the possibilities of doing waiting in different ways. So, for example, cosine measure can be thought of as the thought product of two normalized factors. That means, we first normalize each factor and then we take the thought product. That would be critical to the cosine measure. I just mentioned that the BM25, seems to be one of the most effective formulas. But there has been also further developments in improving BM25. Although, none of these words have changed the BM25 fundamental. So in one line work, people have divide the BM25 F. Here, F stands for field, and this is use BM25 for documents with structures. So for example, you might consider a title field, the abstract, or body of the research article. Or even anchor text on the web page, those are the text fields that describe links to other pages and these can all be combined with a proper way of different fields to help improve scoring for different documents. When we use BM25 for such a document and the obvious choice is to apply BM25 for each field and then combine the scores. Basically, the idea of BM25F is to first combine the frequency counts of terms in all the fields, and then apply BM25. Now, this has advantage of avoiding over counting the first occurrence of the term. Remember in the sublinear transformation of TF, the first occurrence is very important and it contributes a large weight. And if we do that for all the fields, then the same term might have gained a lot of advantage in every field. But when we combine these word frequencies together, we just do the transformation one time. At that time, then the extra occurrences will not be counted as fresh first recurrences. And this method has been working very well for scoring structure with documents. The other line of extension is called a BM25+. In this line, risk is to have to address the problem of over penalization of long documents by BM25. So to address this problem, the fix is actually quite simple. We can simply add a small constant to the TF normalization formula. But what's interesting is that, we can analytically prove that by doing such a small modification, we will fix the problem of over penalization of law documents by the original BM25. So the new formula called BM25+, is empirically and analytically shown to be better than BM25. So to summarize all what we have said about vector space model, here are the major take away points. First, in such a model, we use the similarity of relevance. Assuming that relevance of a document with respect to a query, is basically proportional to the similarity between the query and the document. So naturally, that implies that the query and document must have been represented in the same way. And in this case, we will present them as vectors in high-dimensional vector space. Where the dimensions are defined by words, or concepts, or terms, in general. And we generally, need to use a lot of heuristics to design the ranking function. We use some examples, which show the needs for several heuristics, including Tf weighting and transformation. And IDF weighting, and document length normalization. These major heuristics are the most important of heuristics, to ensure such a general ranking function to work well for all kinds of test. And finally, BM25 and pivoted normalization seem to be the most effective formulas out of the vector space model. Now I have to say that, I put BM25 in the category of vector space model, but in fact, the BM25 has been derived using probabilistic model. So the reason why I've put it in the vector space model is first, the ranking function actually has a nice interpretation in the vector space model. We can easily see, it looks very much like a vector space model, with a special waiting function. The second reason is because the original BM25, has somewhat different form of IDF. And that form of IDF after the [INAUDIBLE] doesn't work so well as the standard IDF that you have seen here. So as effective retrieval function, BM25 should probably use a heuristic modification of the IDF. To make them even more look like a vector space model There are some additional readings. The first is, a paper about the pivoted length normalization. It's an excellent example of using empirical data analysis to suggest the need for length normalization and then further derive the length normalization formula. The second, is the original paper where the BM25 was proposed. The third paper, has a thorough discussion of BM25 and its extensions, particularly BM25 F. And finally, in the last paper has a discussion of improving BM25 to correct the over penalization of long documents. [MUSIC]