SPEAKER: Let us revisit the vector addition example. And now pay attention to how many tasks will be created. So if you recall, it looks something like this, forall I from 0 to n minus 1. A sub I equals B sub I plus C sub I. And this is a very elegant way of specifying the parallelism. But we see that it creates n tasks, which can be really inefficient because the array could be millions in size or even 1 billion elements. And you may be running only on 8 or 16 cores, so you really don't need a million or a billion tasks if you're only using 8 and 16 cores. And in fact creating so many tasks is really going to slow down your program due to the overheads involved. So how do we fix this problem? Well the answer is something that's referred to as chunking or iteration grouping. And we make the observation that even though all these iterations can run in parallel, they don't all have to run in parallel with each other. So we can have two levels of loops. We can have a forall with the variable g, that we refer to as the group ID, that goes from 0 to NG minus 1, where NG is the number of groups. And then within a group, we can have a sequential for loop, I, that just goes through the iterations assigned to group G. So you need G. You need NG. And you need the original range, 0 to n minus 1. And once you have this for loop, you can do the computation as before with I. Algorithmically, it's still the same amount of work. But what you've done is you've tuned down the degree of parallelism. Now you only have NG tasks. And the best value of NG will depend on the program you're running as well as the computer that you're running on. It could equal the number of cores. It may be twice the number of cores. But it could be much, much smaller than N, which is the size of the array. So this is a very simple strategy that uses the fact that we can group together iterations of a forall into groups that execute sequentially in a way that there's parallelism across groups. Which raises the question, how do we determine my group? Well, if you think of the iterations as points in a line, I equals 0 and so on, a most common way of doing the grouping is what's called the block grouping or the block distribution. And you take consecutive iterations and map them to the same group. So you take your iteration range, divide it up into chunks. If your loop body has perfectly balanced work, like vector addition, the chunks could be of equal size. But you could choose to make the chunks of different sizes also. So these iterations would go to group G equals 0. These would go to G equals 1, and so on. And the math for doing that division is pretty simple and can be buried into a helper function for my group for a block distribution. Now, there could be other approaches. Though block works well in practice, if the load is kind of imbalanced, particularly if there's more work to be done early or later in the iteration space, then after the grouping, you may get an uneven workload across the tasks. And some tasks might be idle waiting for others to complete at the end of the forall. So another distribution or way of grouping that's used quite often in practice is called cyclic. And let me try and explain how this works. It's kind of like how a dealer might deal out cards among groups. So what we'll do is take the iteration space again and pick iterations for a group equals zero with steps that are equal to the number of groups. Then the next one, I equals 1. So this would be I equals 0 and I equals NG. This would be I equals 1. I equals NG plus 1 would be group equals to 1. And in fact you can see that the group ID over here is simply I mod NG. And that's how you get the group for a given iteration I. So what we've seen is that the forall construct is very elegant, and a convenient way of specifying parallelism. But there are important practical considerations which is that you want to reduce the amount of tasks created. The way to do that is grouping or chunking, grouping together iterations. And we have a very general approach or schema as to how to do that grouping where you have an outer level forall that goes across groups with the group ID G, and then a sequential for loop that goes through the iterations in the group. The iterations could be consecutive, as in a block distribution of grouping, or they could be distributed, as in cyclic. But either way, you will get through all the iterations and make sure that you execute a parallel loop with fewer numbers of tasks. So now you're well on your way to use loop parallelism in a very practical way on real multicore processors that we have available to us today.