Now, we can adapt our numerical integration example to run on a cluster. For the numerical integration problem, we can also distributed across multiple processors in a cluster. This function is going to be executed by every processor. Each process knows it's rank and numerical id 0123, and the number of processes in the calculation. So we can compute how many steps of integration each processor is responsible for. This value is going to be the same for all processors, but what's going to be different is this; iStart and iEnd. This is the first value of i that this particular rank is responsible for, and this is the next, the last value. So in our parallel loop we are going to start iStart, go up to iEnd and the multiple processes will cover the entire iteration space. There is going to be vectorization inside and multithreading, but what we end up with is multiple partial results stored in the memory of each processor. This is where we stopped with the stencilled example. But with the integral the example, it absolutely makes no sense to have partial results, and we need to aggregate them. Thankfully, this is just a single scalar variable. So the messaging between processes is going to be quick and the function that we are going to use to aggregate the results is MPI all reduce. We already understand that there is a reduction going on for the operator, plus on the integer value. So MPI has reduction built in, and it also has a version of this reduction all reduce that aggregates partial results from multiple processes, and then propagates the result across all of these processors. So, each processor knows the full final answer. In the code calling this function nothing seems to be different except for the usual MPI initialization and it's set up. We also use barriers to start the timing. We don't have a barrier before we finish the timing, because by the time that we exit compute integral, we have already communicated the results across all processors, so there is an implicit barrier there. Let's see how well this works. When the code is compiled, it is compiled with the MPI wrapper or the Intel C++ compiler. When the code is executed it is executed using MPI run, using the machine file produced by the cluster and we are going to compute on four nodes. The single node based computation time was around 80 milliseconds. So, let's see what we have now. Now, the execution time is 20 milliseconds. So we have a clean speedup of four with four processors. And despite the communication which was fairly complex, we still observe linear speed up. In summary, with the integral calculation, we went from serial code, to vector code, to multi-threaded code, to cluster computation. And in this way, we have three layers of parallelism available in modern processors in this numerical integration example.