Let us now look at a very interesting example of the Hadoop Map Reduce framework. And that is computing what I refer as TF-IDF weights for informational retrieval in documents. So the idea here is suppose you have a document D0, you can abstract it as a set of terms which are usually words but it could be other kinds of terms depending on the nature of the document, and you want to try and find the most similar documents across a whole corpus of big data, a large number of documents. So let's say these documents are D1, D2, and so on up to DN. And you have a number of terms that could appear in these documents. So you have term 1, term 2, and so on. So one thing we can compute is the total frequency TF11, TF12, and so on. Where TF12 is the number of times TERM 1 occurs in DOCUMENT 2. And if this is filled out the idea would be to try and use total frequency as a way of trying to figure out which documents are most similar to document D0. And you may be tempted to say okay, the terms that have the largest total frequencies in the documents are the ones that should get more weightage, but people working in the text mining and information retrieval area have noticed that there's a problem with that because there some commonly occurring words like "the" which would have large frequencies across all documents and are not very helpful to indicate which are similar. So this is the TF part of the TF-IDF weight, another part is relates to inverse document frequency and for that we need to define document frequency. So you have DF1, DF2, etc. So DF1 is the number of documents that TERM 1 occurs in. Just counting each occurence once. So it's just counting the number of documents and not paying attention to the total frequencies. What we really want is IDF which is the inverse document frequency. And that is really proportional to the reciprocal of the document frequency, DF, and the idea is to use this as a weight to give more prominence to words that occur infrequently. So the word "the" for example might occur in all documents and then N over DF for "the" would just be 1. However, if you have a word like "aardvark" it may appear in very few documents, and so it could be more useful in identifying similarity. Now, what researchers in this area have concluded is that the best weight to use is something like this. Weight of term I in document J should be TF(i,j), the total frequency, multiplied by log(N / document frequency of I). The log factor is just being used over experience to dampen the effect of the inverse document frequency. So that's where the abbreviations TF and IDF come from, from total frequency and inverse document frequency. Now, why is this interesting from the viewpoint of Hadoop MapReduce? Well it turns out, if you want to compute these weights across a large corpus of documents, you can do it very easily with MapReduce. So for computing the total frequency i, j you have as input really pairs of a document and a set of terms. Because that's what each document is, a collection of terms. And then you could do a map on this to count each occurence of a term in a document. And that could be your new key and you just put out a one for each occurence. And then you could do a reduce, where for each document and term, you can get the total frequency. So there you see the use of map and reduce can give you the total frequencies. To get to the inverse document frequencies what we really need is the document frequency, which is just counting the number of documents that a term appears in. And you can take a very similar approach if you take the same input, we could just map each of this to (TERM, 1). So ignore the total frequency and for every occurrence of a term in a document, I'm assuming that there will be no entries of the total frequency zero. So just to clarify, I'll say when total frequency is greater than 0 just put a one and then you have a reduce which will give you the (term, document frequency). Now, once you can specify your code using map and reduce, as you've seen, the Hadoop framework will take care of everything else in the distributed computing. It can go through terabytes of data, perform the map and the reduce, and the second map and reduce, and get you the weights that you need, which can then be used in information retrieval. So this shows how the very simple but powerful constructs of map and reduce can really enable you to do distributed computing on large volumes of data.