Hello. What I want to talk about now, is how to speed up execution time if you have a worst-case execution time and it's just taking too long. In addition to the compiler optimizations and vector instructions that we just looked at in the previous segment, as we discussed in Course 1 and Course 2, you can certainly use concurrent parallel programming methods, shared memory multiprocessing. You probably don't want to use more exotic distributed memory multi-processing like cluster computing, but you can also use distributed amp scaling as well, which we've also talked about prior. But right now, I just want to remind you of basic concurrent and parallel programming principles for speedup. This fits into the current course because some of the architectures are useful for highly available, highly reliable systems. Then this is also useful when you just need to speed up your execution to be able to close timing for feasibility and also to create margin for safety, which is very important. Let's quickly review parallel systems. They're one of the best ways where inner students with Flynn's taxonomy for parallel systems. You've got single instruction, single data, multi-instruction, single data, SIMD single instruction multi Data and MIMD multi-instruction multi-data. Let's look at the examples. SISD is vanilla, boring traditional uniprocessor, no vector instructions. These are even hard to find today, but you might find it on something like a really simple MCU, like the TV series of microprocesors from Texas Instruments. Probably fit that description. That's all you need maybe for a cyclic executive on anti-lock braking solution. SIMD, ideal for large bit wise integer and floating-point vector math. Our Raspberry Pi's are definitely SIMD. They can generate neon instructions which are SIMD, so can almost every single Intel system with SSE. These are instructions that can take very longer data words, maybe four or eight words at a time, and process them for things like multiply and accumulate, XOR, all interesting instructions. We'll see how that can pay off here. We saw that actually in the previous segment, even with just using it as a compiler instruction for a single core. They can certainly pay off multicore as well. Not only is the Raspberry Pi's SIMD, it really is MIMD because it's multi-core and each core can execute neon vector instructions. The Jetson is SPMD, single program, multiple data. You can have a single program running on the Jetson system and it can generate workload for the synergistic processors in 128 coprocessing array that it has for the GPGPU. This is called SPMD, single program, multiple data. But it's also MPMD, in that it has four cores. They can generate workload for the 128 synergistic processor coprocessor array. So it really is MPMD. The only thing we haven't covered is MISD. MISD, a good example would be TI Hercules. TI Hercules can have multiple cores run-ins, lockstep for the same data. This actually is fairly relevant to our current course on high availability and high reliability mission critical systems, because very often this processor is used for mission-critical systems where the potential for some corruption or glitch can be caught by two processors processing the same data, and if they have a disagreement, then we know there's something seriously wrong. At that point, what we do, is it has to be designed as a recovery scenario or a fail-safe. We might just fail-safe if we just have two processor cores that disagree because we might not know who's right. But we might have some way of trying to determine who's right and recovering as well. We'll talk about that in the HAHR segment coming up to complete this course here soon. The other thing that's interesting that I mentioned in Course 1 and Course 2 is the BeagleBone, which we've used for this course for years prior to the Raspberry Pi and the Jetson, has a new board out called the BeagleBone AI, and it has two Arm A series, so it's multi-core. They can issue vector instructions neon, so it's really MIMD, and then it has two M Series processors for low-level IO processing, and then it has two Texas Instruments VLIW, DSP cores. It has really capable SIMD with VLIW, and there's two cores so it's also MIMD. There's some really exotic hybrid architectures out there that are fascinating when you need speedup these days. So let's go back to sharpen, which I showed speedup for in the first two courses. But we're going to look at it with the eye of what are the methods to apply parallel programming for speedup? We're going to compare that to erast, which is a simple algorithm. The Sieve of Eratosthenes, maybe not the best algorithm, but we just want to see how fast we can make it run. It'll find prime numbers between two values and sequentially it's a really simple program. It looks like it should be easy to speed up. But the problem is it has a test and set operations in a bit wise matrix, where you need to test whether things are prime and then set a bit. So you have readers and writers of this array. When you have multiple readers and writers, as you've learned in OS, you need to protect that shared memory so I'd have. The question is, is it necessary? Because it's unlikely that there's going to be corruption, because these are single bits after all, they're not complicated State information, even though they're shared. But the question is, could it cause us to come up with the wrong number of primes? I'm going to show you that, what I'm going to show you is, that with locks, I actually get, not only do I get no speed-up, I can actually slow down and consume more CPU. That's why I like this example. It would be interesting to see if we could make it lockless or modify the algorithm so it could be lockless. Sometimes you'll hear people talk about I have a lockless algorithm for fill in the blank, and this is what they're getting at, the locks actually slow you down quite a bit so if you can come up with a lockless algorithm, that's better. Speed-up is, what? At best, linear, right? Can it be better than linear? No, not according to Amdahl's Law. Not unless we break the law, and can it be worse? Absolutely, as we'll see with erast. Let's talk about what we're going to see with both erast and sharpen before we go look at it here, and then we'll go look at that and we'll come back to the notes here. We've looked at compiler optimization. We know about that level three. We know about vector instructions, those helped. Compiler optimization is usually pretty large like a 3x for 03. Vector instructions depends on what you have, how much it speeds you up, but it's at least a fractional percentage speed up. In other words, that should say 1.f times not 0.f times. It won't slow you down because it just generates instructions. I'm sorry for that slide bug, I'll fix it in the slides I post. It can never hurt to turn on SIMD. But it might not help. Using multiple cores, that almost always help unless you have a situation like erast. Here's the deal, sharpen is embarrassingly parallel, erast is not. Erast has data dependencies and shared memory, and that's going to make it hard to make it efficiently parallel. Whereas sharpen, we can thread grid it, and the threads have nothing to do with each other, even though they share the same memory, and we can just let them rip. We can combine number one, two, and three, to get significant speedup. We got 3.2x, and you can see that I was able to saturate the course here, and then I got the most dramatic speedup with MPMD with CUDA and my jetson, 70x with that. But that's not hard to believe. I have a 128 co-processor, so I can probably even get it faster if I worked harder on my CUDA code. Actually, let's see. We'll come back and talk about this, but let's go ahead and do some experiments now on the system. Let's look at erast first. If you remember, if I run erast just with the threads, excuse me, I was going to run the simple one, but I ran the threaded one first. If I go over and look at the top, by the way, what you see is, you see it go from sequential to parallel there. You just saw this burst, boom, all the threads got launched even though we have locks. Now the question is how much is the lock overhead? We see it declining here. That could be because of the locks. Then we see it drop off a lot and it's done. Well, it got done. Basically, it did four million numbers per second. Let's time it. I put some timestamps in there, but it looks like they're not working the way I expected them to. I'll fix that in the code also. We can use the Linux built-in time if your timestamps aren't functioning correctly. But it looks like it says "Start test" at the beginning. Started test at zero. Then that's suppose to say "Stop test" That's the only bug there. It took 24 seconds and yeah, sure enough, the Linux time agrees so that's what's going on. Now let's try just regular erast, single-threaded, keep in mind, wrong one. I keep getting the names differently. Shouldn't we call it a erast grid head rather than erast sin for the sequential. The sequential one starts at 0.01 and it only took 8.5 seconds compared to 24. Significant slowdown. Let's use the time as well and see if we can figure out what's going on there. Note this, 4.3 seconds of kernel time. Those locks are expensive. Here we have 0.7. The foil or the locks, maybe we can just take them out. I'm going to let you experiment with that and you can decide whether we get the right number prime. Keep in mind that we can look this up. There's websites out there that you can find, they'll tell you how many primes are between zero and 100 million, I believe it's what we're running here. You can see if it's right, the single threaded is going to be right, 5,761,455. You can get it to prime the primes as well. But I don't think you want to do that. Now let's run the threaded version,. Before we leave this, let me just show you that I was running with vector instructions and O3 there. I'm going to do that over here too so I won't even bother to show you that. But if I do a make here, I've already made it. I'm going to run sharpen grid. There we go. I'm going to time it too. What we see is that this is blazingly fast, right? Let me do that for you again. What is that? 2.27 seconds, we're doing 12 megapixel frames. That's fast. It's only 0.6 frames per second, but these are 12 megapixel frames. These are ridiculously high resolution frames. This is just a Raspberry Pi, we could definitely get this to go faster on Jetson, as you know, we saw the 70 to X speedup. We could do sharpen by comparison. We got that done in 1.87 seconds. Yeah, basically, the system time agreed with that the real-time was 2.27. I don't know how the user time could be larger. That doesn't really make any sense. I like putting my own timestamps in, let's us do this sharpen without any threading, just to remind you. It's taken us, with the best we get at most, 2.6 seconds per frame compared to much faster rate for the grid version. We're going over the same number of frames. That took us seven seconds as opposed to really a few seconds up here. Completed test at 1.8 versus seven seconds. Sometimes threading pays off, sometimes threading doesn't. That's the moral of this story. Let's go back to the slides and talk about that. Here we see that the co-processors definitely pay off. Co-processors like FPGAs, GPUs can pay off very well. Threading can pay off when things are embarrassingly parallel, but when they're not, it can actually cause regression. Why is that? Well, it's Amdahl's law. With the Amdahl's law, the best I can do with four cores is get a 4x speed-up, and I got a 3.2x speed-up. With the 128 co-processor core, in theory, I could get a 128x speed-up, but I'm nowhere near linear, but at least I'm getting 70x. This is a plot of Amdahl's cores that maximum speed-up is based on infinite cores. Thirty-two cores is down here, so a 128 is close enough to infinite that I've just compared to this line, but we can put in a 128 more specifically. What is Amdahl's law? Real simple equations, max speed-up equals 1 over 1 minus P, the sequential portion, plus the P over S, P is the parallel portion, and S is the number of processors. Not sequential, don't fall for that. It's a parallel portion divided by S. How long it takes for that divided by the number of processors, 1 minus P is sequential. This is really a percentage, that parallel portion is sequential portion. This might be 80 percent, this might be 20 percent, 1 minus 80 would give you the 20 percent. This is that infinite line that just says I have infinite processors. S is large, this becomes zero. The point of this is that you've got to be careful, and think about how well I can actually make my code parallel. I'll let you know a little secret. I already saw an optimization for sharpening grid, by the way. I'll let you guys look for it. But I could get more of it to be parallel in the parallel section in less sequential with some pretty simple modifications. That's embarrassingly parallel is good. Image processing often is embarrassingly parallel, but not always. Sometimes we have algorithms where it depends upon processing in previous frames or other parts of a frame, and so forth. Things like searching often can be half dated dependencies like the prime number hunting. Therefore, it makes it harder to speed them up. Therefore, things like quantum computers are interesting. Overall, we've looked at this many times from the real-time viewpoint, but let's look at it a little differently now, let's look at it from the speed-up viewpoint. What I really want to do is I want to saturate my resources, and I want to get closer to being CPU, I/O bound, and memory bound if my point is to go fast. In some sense, speeding things up is a little bit at odds with having margin. So co-processors, therefore, are a nice solution. Go ahead and saturate the co-processors and make the real-time processors have much more margin. On the Spitzer Space Telescope, that's exactly what we did. We had an instrumentation processor. The whole point was to be up here as high as we could in this box. Up in this range over here, where the whole point is to be saturating that CPU, and 95 percent here on this axis, 95 percent here, and 95 percent on this axis. That's what the instrumentation processor basically was. Then the spacecraft is down here because we want that to be high margin. We don't want to lose all spacecraft, but we want to get as much data as we can while we're out there. The three-space view is interesting. For HRT, we often want to be here. For a co-proc, the whole point is to get up here and saturate those co-processing cores get as much as we can out of MIMD architectures, and that thing, or an FPGA. We always want to ask yourself, what's our CPU margin? What's our OI latency and bandwidth? What's our memory capacity? If we're in the upper-right front corner, this is bad for HRT, but it's good for a co-proc. I'm just trying to emphasize that. CPU plus I/O plus memory bound, a bad day for real-time system for HRT, and it's a good day for high-performance computing. That's interesting. On that note, we'll end and we'll let you explore those two programs and see what you can do with them. Play around them if you got a Jetson. I've got the GPU code as well. I actually posted that in course 1 and 2 for those of you who are running the Jetson, but I can dig that up and post that again as well. Well, thank you very much.