[SOUND] We're going to move on now to look at graph processing and what infrastructure you need for that graph processing. We've seen the need to do it, but how do you provide support for that? There's two ways to go. You can either have a super computer, providing your graph processing, lots of memory, very high performance machines. Or if you want to scale it to normal sizes, you're going to have to look for a cloud solution with distribute processing. It's in general, going to be more expensive to design this. Unless we can find a way to map graph processing onto a cheap cloud infrastructure. The super computing approaches will be more expensive. So, natural question is can we actually use MapReduce, and would that work? Well, it tends to be inefficient because the graph state must be stored at each stage of the graph algorithm. And each computational stage will produce much communication between the stages. So, if you do a MapReduce and you do a MapReduce on all the vertices, what's going to happen is, you're going to have to visit all of the data structures, and it's going to be very expensive, even if your algorithm doesn't actually change all the data structures. And we'll see later that there's some algorithms that don't. Single computer in a library approach, well, that's not very scalable, though it works on a super computer up to some large extent, but once you reach out of memory, once you lose, once can't fit your graph in the memory as we've seen, there's lots of problems with graph processing that won't do that, then you have trouble. So, if we're doing a shared memory system implementation on a super computer. You do also have to worry about how you get fault-tolerance and if the computation runs for a long while, and there's a chance that machines die, how are you going to make sure that the computation will continue? And one of the advantages of using distributed computing is that machines can die and you can have backup pieces and you can keep going even in those circumstances. So, we're running into the area where we may not be able to do the computations we want if we can't figure out how to map it on a distributed machine, and maybe we don't need it as efficient. Maybe what we will do is just use the massive approach to try and make the computation work, provide the full tolerance and provide the computation. And it won't be as efficient as we would really desire. So, we are going to look at a parallel model again. The BSP model is still going to be the valiant BSP model, and it's still going to be the direction that we're going to take. We're going to provide graphs and often we're going to be sort of oriented towards a directed graph. I'm going to show you Pregel. Pregel was one of the first systems to do graph processing on a distributed system. It was running to show how you could actually solve the problem. What it's oriented towards, as you all have heard, as you will have from the first lecture, is going to be vertex space. We're going to do the computations around each vertex, and each vertex makes sense of processing. So you're going to move the computation forward by processing a particular vertex. And we are going to keep repeating on all vertices that need to be processed until every computation in each vertex votes to say, okay, I have done what I have needed to do on this particular stage of the graph processing. And once all of those vertices vote and say, okay, that may not be all the vertices. That could be a subset. What we're going to do is to return back to directed graph and allow yet another iteration. Those iterations are going to be sort of like super steps in our method. So, the primitives you will find in Pregel they deal with vertices, they treat them as first class entities, meaning we will be programming with the vertices. With this vertices, we're going to do this, this and this. The edges aren't first class, they're incidental when we visit a vertex, we're going to be talking about the edges. And, we will be exploring what's on the edges, what are the edges labeled, what are they connected to? We will allow both vertices and edges to be created and destroyed in the computation if we need to. So, Pregel was based around C++, it has a C++ API. It moves ahead with a set of supersteps. Superstep, you look at the vertices, the application code using C++, an object-oriented programming. Subclasses, an abstract notion of an abstract class of vertex and it writes a compute method for that vertex, and in that compute method, you're going to say, we're taking into account all of the edges. What do I want to do with this vertex? You can set the vertex value. You can set the outgoing edges values. You can get them. You can send and receive messages generally, and you'll be reading messages sent to a particular vertex. You can either do it from the previous step in Step S. So, Superstep S minus one you can look at the edges or read the messages. You can send messages to other vertices that will be received at superstep S+1. And that way, sort of allowing you to go from step to step computation to computation. Each time you could be modifying a set of V, and its outgoing edges and they'll be available to the next superstep. So, the C++ API provides messaging. It has a guaranteed message delivery order. So it's like TCP. Messages deliver exactly once. So, you don't have to worry about failures of messages to get through and you can send a message to any node in the system at all. So, if you want it to wrap around or you want to have a broadcast or whatever, it's fine. If the destination doesn't exist, then, there's a user function will be called to tell you that you've made an error and you need to fix the problem. [MUSIC]