Hi! And welcome to this new lesson. During the last one we saw an example where unrolling by a factor of 2 provided a straight to X reduction in the latency of the loop at the cost of 2x extra resourches for its implementation. This was sort of an ideal case. In fact, it might not be possible to achieve such an ideal latency improvement. When performing loops optimizations, there are two potential issues that needs to be considered: - Constraints on the number of available memory ports and available hardware resources - Loop-carried dependencies. To understand these potential issues, let’s consider some more examples. What is the loop latency you would expect by unrolling the loop 4 times instead of 2? Well, if we assume that the loop body latency remains constant to 10 cycles and the number of loop iterations reduces by a factor of 4, we would get a latency of 10 x 256, which is equal to 2560 which is 4 times smaller than the latency of the original loop. Well, this is not totally correct. Indeed, by looking at the synthesis report, we can see that the iteration latency of the loop is now 11 cycles and not 10 as expected! Hence our loop latency is 2816 cycles instead of 2560. But why do we get this extra cycles considering the fact that all the loop iterations should be able to be performed in parallel? This issue resides in the load and store operations. As we can see from the analysis report, two read operations on array local_a are scheduled on the first cycle of the loop, while the remaining two read operations are scheduled in the next cycle. The same happens also for the read operations on array local_b and on the write operations on array local_res. The reason for this comes from the fact that local arrays are stored on BRAM resources on the FPGA and each BRAM provides up to two memory ports that can be used to perform read and write operations. In other words, it is not possible to load more than 2 memory elements per cycle and hence, Vivado HLS schedules multiple load operations in different cycles. This delays 2 out of the 4 floating-point additions by one cycle, and the final loop iteration latency consists of 11 cycles. One way to overcome this issue is by storing different elements of the arrays on different local memories in order to increase the number of elements that can be read in parallel. However, this optimization will be discussed when dealing with the array partitioning optimization. So, for now, just recall that we have to pay these extra cycles. So far, we have considered an embarrassingly parallel code, where all the iterations of the loop are completely independent from each other. Let’s now consider a different code, namely a simple kernel that performs the dot product of two vectors, also known as scalar product. By looking at the “prod_loop”, that performs the actual scalar product, we can see that each iteration of the loop depends on the previous iteration. Indeed, the variable called “product” holds the accumulation of the products of the components of the two vectors local_a and local_b. In other words, the variable adds its previous value with the product from the two vector components at iteration i. Looking at this annotated analysis report, we can see that we have a loop-carried dependency. The result of the floating point adder (FADD) is sent to a multiplexer that provides the value in input to the same floating-point adder in the next iteration of the loop. The multiplexer is needed because we need to distinguish between the first iteration of the loop, in which the value of the accumulation is 0, and the next iterations, in which the value of the accumulation is the one from the previous loop iteration. Notice that the loop also contains the read operations to get the values from the two vectors and a floating-point multiplier that computes the product of the vector components which is provided in input to the floating-point adder. Overall, the iteration latency of the loop is 13 and the loop is iterated 1024 times which yield a total loop latency of 13312 cycles. What is the iteration latency and the trip count that you would expect by unrolling the loop by a factor of 2? Well, let’s try it out! The trip count has halved as expected, but the iteration latency has increased from 13 to 20 cycles. Overall, the loop latency is now 10240 cycles which is only 23% less than the original 13312 cycles. This is far from the 50% reduction that we achieved for our vector sum example. To understand why this is the case, let’s look at an annotated analysis report side by side to a manually unrolled code. After the unrolling, we get an extra floating-point addition and an extra floating point multiplication to schedule. Nevertheless, the two floating-point additions cannot be scheduled in parallel since the second addition requires the result form the first addition. Overall, Vivado HLS manages to hide the execution time of the second floating-point multiplication, however, the length of the carry-dependency cycle has increased by a floating-point addition which accounts for the 7 extra cycles in our loop iteration latency. In this situation, the loop-carried dependency severely impacts on the performance that we can extract from our loop simply by applying the unrolling optimization. Indeed, even if we try to unroll the loop 4 times, we would get a loop-carry dependency path consisting of 4 floating-point additions and the loop iteration latency would increase as well. For this particular example, one possible solution to reduce the impact of carried dependencies, would be to manually unroll the code and perform the additions as a tree reduction in order to expose higher instruction level parallelism.