We will now look at the page rank algorithm as an example of using the Spark framework for distributed computing. So, the idea behind page rank, which dates back a couple of decades, is that if you have a collection of web pages A, B, and so on. As you know, you can have links from one webpage to another. And the goal is to figure out which webpages are more important than others. So, what the page rank says is that the rank of a webpage B, should be the sum of if you take all the predecessors of B, or all the source websites of B, and you take the rank of A and divide that by the destination count of A. So, the idea is that page B is going to be important if many pages point to it. In particular if many important pages point to it. And dividing by the destination count shows that if a page has many outgoing links, then you get relatively less weightage, because you're only a fraction of the outgoing links. So, this is a mathematical formula that's been worked out based on some probabilistic understanding of websites and their links. And the question is how can we compute this? Well, if we have to do a sequential algorithm what we could do is we could iterate because there's a recursive connection here where the rank of a page gets updated based on the rank of its predecessors. So, we could have an iterative computation, say for ITER =. And there are two steps in this computation. First is for each page A. You need to figure out some contributions to B. Which is the rank of A divided by the destination count of A. So, you'll be doing this for all outgoing pages B. So, let's say we do this for each link B from page A. And you get the contributions. And then, the second part is that for each page, B, you compute the rank of B as a function of the contributions that arrive to page B. Now, this business about 0.15 and 0.85 is a weighting factor that's been determined empirically to be helpful to get reasonable approximations. So, what you're doing is you're repeatedly updating the ranks of A and B. You start by initializing them all to 1 and based on the contributions that come into page B from different predecessors you update the rank of B and in turn that will update the contributions to the successors of B. So, this is a well understood algorithm and approach to computing page ranks. The question is how can we do this in Spark? Well, it turns out that this iterative computing is very well suited to Spark, because what we will be doing, is we will be caching the values of the rank and the contributions as we make repeated iterations. And this means we can do the computations in memory rather than from storage such as disk. So, to compute the contributions in Spark, we will have to do something like contributions = links. You do a join on the ranks and then you need to get the value. So, this is basically collecting together all outgoing links from the same page. And then, you do a map, which in Spark, will be something like flatMapToPair and for every destination, we will map a pair destination, and the new contribution, which is the rank of the source divided by the destination count. So, in this way we can collect the contributions using the standard Spark statements that we saw earlier. The operations like join and map and we get this collection of dest, contribution pairs for all the destinations. For the second step, well, that's relatively easy. What we need to do is compute the new value of ranks from contributions by doing a reduce. Because we're going to be summing up the contributions for every destination page. So, we reduce by key and we have to specify the reduce operator so it will be a sum. So we say, new sum. And then we have to do a map to get the proper weighting of that 0.15 and 0.85. And we can map the values to sum goes to 0.15 plus 0.85 times sum. So, what this shows is that if you have the hang of functional programming in Spark, you can take a mathematical expression for the algorithm, look at the pseudo code, and then, take each data parallel step inside the loop and map it to a Spark statement. Now, we will have the outside for iter loop in Spark as well, and that's where you get the benefit of in-memory caching. The number of iterations kind of varies, depending on how approximate or precise a solution you need. In practice, 10 to 20 iterations suffice. You can do more iterations if you want a more precise convergence. So, there you have it. You can start with mathematical formulae, convert them to pseudo code to make sure you have the details of the formula, and then, map the pseudo code into Spark statements. And you will have a distributed computing algorithm that can keep iterating over data and perform much better than Hadoop, in this case, where you might need to go to storage for Hadoop in every iteration. So, with this, you're off on your way to try many more examples using the Spark framework for distributed computing.