[SOUND] This lecture is about the Web Indexing. In this lecture, we will continue talking about the Web Search and we're going to talk about how to create a Web Scale Index. So once we crawl the web, we've got a lot of web pages. The next step is to use the indexer to create the inverted index. In general, we can use the same information retrieval techniques for creating an index and that is what we talked about in previous lectures, but there are there are new challenges that we have to solve. For web scale indexing, and the two main challenges are scalability and efficiency. The index would be so large, that it cannot actually fit into any single machine or single disk. So we have to store the data on virtual machines. Also, because the data is so large, it's beneficial to process the data in parallel, so that we can produce index quickly. Now to address these challenges, Google has made a number of innovations. One is the Google File System that's a general File system, that can help programmers manage files stored on a cluster of machines. The second is MapReduce. This is a general software framework for supporting parallel computation. Hadoop is the most well known open source implementation of MapReduce. Now used in many applications. So, this is the architecture of the google file system. It uses a very simple centralized management mechanism to manage all the specific locations of. Files, so it maintains the file namespace and look up a table to know where exactly each file is stored. The application client will then talk to this GFS master, and that obtains specific locations of the files they want to process. And once the GFS file kind obtained the specific location about the files, then the application client can talk to the specific servers whether data actually sits directly, so you can avoid involving other node. In the network. So when this file system stores the files on machines, the system also with great fixed sizes of chunks, so the data files are separated into. Many chunks. Each chunk is 64 MB, so it's pretty big. And that's appropriate for large data processing. These chunks are replicated to ensure reliability. So this is something that the programmer doesn't have to worry about, and it's all taken care of by this file system. So from the application perspective, the programmer would see this as if it's a normal file. And the programmer doesn't have to know where exactly it is stored and can just invoke high level. Operators to process the file. And another feature is that the data transfer is directly between application and chunk servers. So it's efficient in this sense. On top of the Google file system, Google also proposed MapReduce as a general framework for parallel programming. Now, this is very useful to support a task like building inverted index. And so, this framework is, Hiding a lot of low-level features from the program. As a result, the programmer can make a minimum effort to create an application that can be run a large cluster in parallel. So some of the low level details are hidden in the framework including the specific and network communications or load balancing or where the task are executed. All these details are hidden from the programmer. There is also a nice feature which is the built in fault tolerance. If one server is broken, the server is down, and then some tasks may not be finished. Then the MapReduce mapper will know that the task has not been done. So it automatically dispatches a task on other servers that can do the job. And therefore, again the program doesn't have to worry about that So here's how MapReduce works. The input data would be separated into a number of key value pairs. Now what exactly is in the value would depend on the data and it's actually a fairly general framework to allow you to just partition the data into different parts and each part can be then processed in parallel. Each key value pair would be and send it to a map function. The program was right the map function, of course. And then the map function will process this Key Value pair and then generate a number of other Key Value pairs. Of course, the new key is usually different from the old key that's given to the map as input. And these key value pairs are the output of the map function and all the outputs of all the map functions would be then collected, and then there will be for the sorting based on the key. And the result is that, all the values that are associated with the same key will be then grouped together. So now we've got a pair of of a key and separate values attached to this key. So this would then be sent to a reduce function. Now, of course, each reduce function will handle a different key, so we will send these output values to multiple reduce functions each handling a unique key. A reduce function would then process the input, which is a key in a set of values to produce another set of key values as the output. So these output values would be then corrected together to form the final output. And so, this is the general framework of MapReduce. Now the programmer only needs to write the Map function and the Reduce function. Everything else is actually taken care of by the MapReduce framework. So you can see the program really only needs to do minimum work. And with such a framework, the input data can be partitioned into multiple parts, which is processing parallel first by map, and then being the process after we reach the reduced stage. The much more reduced if I'm [INAUDIBLE] can also further process the different keys and their associated values in parallel. So it achieves some, it achieves the purpose of parallel processing of a large data set. So let's take a look at a simple example. And that's Word Counting. The input is containing words, and the output that we want to generate is the number of occurrences of each word. So it's the Word Count. We know this kind of counting would be useful to, for example, assess the popularity of a word in a large collection and this is useful for achieving a factor of IDF wading for search. So how can we solve this problem? Well, one natural thought is that, well this task can be done in parallel by simply counting different parts of the file in parallel, and then in the end we just combine all the counts. And that's precisely the idea of what we can do with MapReduce. We can parallelize on lines in this input file. So more specifically, we can assume the input to each map function is a key value pair that represents the line number and the string on that line. So the first line, for example, has a key of one and that is another word by word and just the four words on that line. So this key value pair would be sent to a Map Function. The Map Function then would just count the words in this line. And in this case, of course there are only four words. Each world gets a count of one and these are the output that you see here on this slide from this map function. So the map function is really very simple if you look at what the pseudocode looks like on the right side, you see it simply needs to iterate all the words and this line. And then just collect the function which means it would then send the word and the count to the collector. The collector would then try to sort all these key value pairs from different Map Functions, right? So the function is very simple and the programmer specifies this function as a way to process each part of the data. Of course, the second line will be handled by a different Map Function which we will produce a single output. Okay, now the output from the map functions will be then and send it to a collector and the collector would do the internal grouping or sorting. So at this stage, you can see, we have collected a match for pairs. Each pair is a word and its count in a line. So, once we see all these pairs. Then we can sort them based on the key, which is the word. So we will collect all the counts of a word, like bye here, together. And similarly, we do that for other words. Like Hadoop, Hello, etc. So each word now is attached to a number of values, a number of counts. And these counts represent the occurrences to solve this word in different lights. So now we have got a new pair of a key and a set of values, and this pair will then be fed into reduce function, so the reduce function now would have to finish the job of counting the total occurrences of this word. Now, it has all ready got all these puzzle accounts, so all it needs to do is simply to add them up. So the reduce function here is very simple, as well. You have a counter, and then iterate all the other words. That you'll see in this array. And that, you just accumulate accounts, right? And then finally, you output the P and the proto account. And that's precisely what we want as the output of this whole program. So you can see, this is all ready very similar to. To building an Invert index. And if you think about it, the output here is index. And we have already got a dictionary, basically. We have got the count. But what's missing is the document the specific frequency counts of words in those documents. So we can modify this slightly to actually be able to index in parallel, so here's one way to do that. So in this case, we can assume the input from Map Function is a pair of a key which denotes the document ID, and the value denoting the screen for that document, so it's all the words in that document. And so, the map function would do something very similar to what we have seen in the word campaign example. It simply groups all the counts of this word in this document together. And it would then generate a set of key value pairs. Each key is a word, and the value is the count of this word in this document plus the document ID. Now, you can easily see why we need to add document ID here, because later in inverted index, we would like to keep this formation, so the Map Function should keep track of it, and this can then be sent to the reduce function later. Now similarly another document D2 can be processed in the same way. So in the end, again, there is a sorting mechanism that would group them together. And then we will have just a key, like a java, associated with all the documents that match this key. Or all the documents where java occurred. And the counts, so the counts of java in those documents. And this will be collected together. And this will be, so fed into the reduce function. So now you can see the reduce function has already got input that looks like an inverted index entry. So it's just the word and all the documents that contain the word and the frequencies of the word in those documents. So all you need to do is simply to concatenate them into a continuous chunk of data. And this can be done written to a file system. So basically the reduce function is going to do very minimal. Work. And so, this is a pseudo-code for [INAUDIBLE] that's construction. Here we see two functions, procedure Map and procedure Reduce. And a programmer would specify these two functions to program on top of map reduce. And you can see basically they are doing what I just described. In the case of map, it's going to count the occurrences of a word using the AssociativeArray. And it would output all the counts together with the document ID here. So, this is the reduce function, on the other hand, simply concatenates all the input that it has been given, and then put them together as one single entry for this key. So this is a very simple MapReduce function, yet it would allow us to construct an inverted index at very large scale, and the data can be processed by different machines. And program doesn't have to take care of the details. So this is how we can do parallel index construction for web search. So to summarize, web scale indexing requires some new techniques that go beyond the. Standard traditional indexing techniques. Mainly, we have to store index on multiple machines. And this is usually done by using a filing system, like a Google file system. But this should be through a file system. And secondly, it requires creating an index an parallel, because it's so large and takes long time to create an index for all the documents. So if we can do it in parallel, it will be much faster and this is done by using the MapReduce framework. Note that both the GFS and MapReduce frameworks are very general, so they can also support many other applications. [MUSIC]