Hello. The world of the efficient MapReduce is based on three whales. Combiner, Partitioner, and Comparator. In the last video, we discussed in detail Combiner. In this video, we will talk about a special number huge beast, or more precisely, whale called Partitioner. You are already familiar with the WordCount MapReduce application and you might have got tired of it. Let me show you another view of this problem. Collocation is a sequence occur to get unusual often. For instance, the United States and New York are collocations. If you like to find collocations of size two in a data sets, for example, Wikipedia sample, then you need to count Bigrams. It would be the same WordCount MapReduce application. The only difference is that you will count bigrams of words instead of independent words, I gave to you in past time. The following mapper will emit a sequence of bigrams followed aggregation during their use phase. If you call this script is out changes, then Hadoop MapReduce frame work will distribute and sort data by the first word. Because everything before the first tab character is considered a key. Due to the fact that you don't have any guarantee about the way your items order, the data on a reducer will not be sorted by the second word. Of course, you can update reducer.py to count all bigrams for the first corresponding word in memory. Exactly as you see in the slide, but it will be memory consuming. In this slide you can see the output of these MapReduce application which validates that New York bigram is a collocation. In addition to the unnecessary memory consumption there would be an even lot on the reducers. You know that there are some words which occur far more frequently than others. For instance, one of the most popular words in the English language, is an article, The. The benefit of MapReduce that it provides functionality to parallelized work. In a default scenario you will have the far more lot on the reducer that will be busy processing this article The. But you have no need to send all of the bigrams starting with The to one reducer as you do calculations for each pair of words independently. You can change the output of the mapper and substitutes the first tab character with a spatial symbol which will solve your problem. But it would be more difficult for a user to differentiate between the words in the diagram visually. As you could have already guess, here the partitioner comes into play. Command line arguments you can specify the way to split as three mapper or reduce your output to a key value pair. In this case you would like to split the line into key value pairs by the second tab character. This slide shows the API which helps you to do it from CLI. If you call it again then we should complete this MapReduce job faster due to better parallelism. If you list files in our output directory you should see that bigrams starting with arbitrary allocated in different files. Let me show you a few more useful flags just to close the loop on the subject of data partitioning in streaming script. Imagine you are walking with a collection of IPv4 network addresses. If you have not seen them before, IPv4 address contains four numbers called Octets delimited by dots. The name Octet came from the fact that these numbers are limited by two in the power of eight. You can specify what a delimiter is and set number of fields related to a key. MapReduce framework will substitute this particular delimiter between num and num+1 fields to a tab character without any changes in the estreaments groups. In this tiny example, I told a MapReduce framework that I would like to split the output from the streaming mapper by the first dot. And from the reducers stream and output, I substituted the next but one dot with a key value MapReduce delimiter, which is a tab character. There are even more handy tricks that you are sure to like. For instance, if for some reason you'd like to partition IPv4 addresses by the second character of a first octet, then you will be able to do it with a simple CLI quote with the following arguments. You specify the field index and the starting character index in the start position. And you specify the field index and the character index in the end position. In this case, the data will be partitioned by the letter a or b in the first octet. As a side note, you will never see letters in IPv4. I used them to highlight the point of partitioning by slides. And API for partition flags is equivalent for UNIX CLI sort key depth. As we could have mentioned, I have to set a special partitioner called KeyFieldBasedPartitioner. It is a Java class located in estram in Jave. Occasionally, some partitionality is only possible to do with Java. You can write your own Java class to do partitioning. But there is already a collection of auxiliary classes that you can use to tune your stream in MapReduce application. From my personal experience knowing how to work with this one will be enough for the biggest amount of straming applications that you will write. Moving back to a bigger picture. People use the following diagram to understand the whole pipeline of MapReduce application execution. You have mappers at the top. Then the data goes through combiners, then it is distributed by the partitioner. Finally there is a reduced space and that is it. In reality functionality of combiner and partitioner is spread across the number of stages starting by in memory calculations during the map phase. For instance, Hadoop applies the combiner at quite a number of places. Summing up, in this video you have learned what a partitioner is and how to specify it for streaming MapReduce application. You have also learned how to count bigrams In MapReduce, and how to spread the load over the reducers with the help of Partitioner.