0:00
Our stencil code can run on a cluster with the help of MBI.
Our stencil code is now well optimized to run on a single computer.
We take advantage of parallelism in the form of vectorization and
multi-threading and we have optimized the memory access pattern in it.
The next logical step would be to try to further parallelize this code,
to run on multiple independent computers in a cluster.
Clearly, there is past parallelism in this application.
Different roles can be processed independently,
And we really take advantage of that when we
paralyze the processing of this image across threads,
different throws go to different threads.
We can further split up the work.
Now, we will split the work up between multiple independent computers.
Each computer will be responsible for processing
only its own part of the image with a stencil operator.
So this is process 0, 1, 2, 3,
and each process will produce its own piece of the output.
In this example, we will just keep this piece of the output on each respective processor.
We are not going to try to aggregate these results on a single machine.
At the end of the video, I will talk about what it means,
when it makes sense,
and how it applies to real world problems.
Inside the code, when I want to
distribute the image processing across multiple processors,
I can use my rank,
my numerical identifier to compute which part of the image I'm responsible for.
Remember in MPI, usually applications are designed in
such a way that every processor runs the same code,
but they will have different numerical IDs, ranks.
Using the rank, I can compute
the first row and the last row that i is the processor,
and responsible for it.
The way that this formula is set up,
the entire range of rows will be covered by all of my processes,
but each will process only a fraction of the rows.
I use my first row and my last row in the loop over the rows.
So, instead of going from one to height,
I go from my first row to my last row,
and if I'm using four processors,
each will process only a quarter of the rows,
and each processor will know only a quarter of the final answer.
Well, how do I get the rank and the number of ranks?
Now, this is done using the usual MPI initialization in the main function.
Here, I include the header file and gathered each.
And, I call MPI in it, to propagate
the command line arguments from process number 0 to all others.
After that, I can query the rank of the current process,
and the total number of processes in the MPI application.
Once again, MPI applications usually
are designed to perpetuate that from the very beginning,
you are going in parallel.
Each processor is going to execute every line in this block of code.
And the ranks, I use them when I start the stencil calculation.
Here, I pass my rank and the number of ranks to the stencil function.
To make the timing reliable,
I make sure that every process has
reached the point where it's ready to begin the calculation.
I insert a barrier for that purpose.
And also, at the end of the calculation,
to make sure that all processes have finished their calculations,
I put an MPI barrier again.
As you can see, I had to tweak the code of
the application so that I don't get redundant output on the rank zero, print output.
And when I write the output file,
I will only be writing a fraction of the result.
So inside image.cc, I actually tweak the file name
and I prepend to it the number of
the processor that was executing this part of the calculation.
When I compile, I use the MPI wrapper around the intelC++ compiler,
and this allows me to link my code to the MPI library.
And when I queue the calculation for execution,
I request four nodes and these four nodes are going to be requested,
and their host names will be placed in the special node file,
produced by our resource manager.
So I tell MPIrun,
where to find the list of nodes using that augment machine file.
The results should come out quickly,
and here they are.
As you can see, we are achieving 3.2 trillion operations per second,
which is a factor of three faster than we could do with one processor.
So, four processors give us a speed up by a factor of three.
This is not an ideal speed up,
and this is somewhat expected because,
the problem size is not large enough.
The output files are written out into zero output, one output, two,
and three, and I have copied these files to my local machine so that I can view them,
and they look like this.
Zero output contains only this top part of the image.
One output contains the second quarter,
third quarter, and the fourth quarter.
So, each processor knows only a fraction of the total answer, and,
if you zoom in you will be able to see the edges
detected in the image by our stencil application.
While for image processing,
it probably doesn't make sense to keep parts of the result in each processor,
because if you want to present it to the user,
you want to aggregate these results.
I know that if we wanted to do that,
then distribute computing would not be a good fit for the task.
Why? Well, that's because the rate at which we are
processing the image is around 100 gigabytes per second, in terms of bandwidth.
If you wanted to aggregate the different parts of the image on one processor,
we would have to go through the network,
and even if we use a specialized high bandwidth fabric such as,
Intel Omni-Path for generation fabric.
The bandwidth that we can expect,
between different machines, is of order 10 or 12 gigabytes per second.
So, inside memory, we are processing it at 100 gigabytes per second, and then,
all of the data that we processed we have to send
across a network at 10 gigabytes per second,
an order of magnitude slower.
So, clustering would not pay off in this particular example.
It doesn't mean that stencil operations are not suitable for clustering.
In fact, if the stencil operation is only a part of your workload,
and then each part can be further processed in the pipeline,
it makes sense to keep the images on independent nodes.
Also, if your stencil is a part of a computational fluid dynamics work flow,
then this is what will happen.
In a computational fluid dynamic work flow,
you will have a distribution.
Each processor will be responsible for its own part of the simulation domain.
And after each time step,
after each stencil application,
you will have to exchange information with neighbors,
but you will only exchange the boundary cells.
Instead of sending this entire piece of the image,
you will only send a few cells around the boundary,
a few rows of cells around the boundary.
And if you have a large enough problem,
this exchange of boundaries takes a negligible amount of time, compared to processing.
And this is when stencil computation is scalable across the cluster.
And of course, it is a more complicated communication pattern,
and this is a subject for discussion in a more advanced MPI class.
We have taken the stencil code a long way from where we started.
We started at 2.3 billion operations per second,
and we ended up with three trillion operations
per second through vectorization, multi-threading,
improvements in memory access and clustering.
This is how parallelism allows you to unlock
the performance potential of your modern computing system.