Previously we talked about the problem of single event upsets, and solar wind and trapped charged particle radiation and so forth, and why we need error correction code interfaced memory. We want to now talk about the theory of the operation and use of ECC memory that meets the SECDED design goals, single error correction, double error detection. We talked about before the earth geomagnetic field, and the fact that if you're, say, operating in satellite or aircraft, at altitude that you have more proton and electron flux, so you have more probability of an SEU. We talked about how an SBE is likely at, especially at altitude is even possible on the surface of the earth, [inaudible]. So enterprise systems often includes ECC, but a DBE, a double bit error or multi bit errors less likely, and I gave you the chain-link fence analogy with golf balls and the probability of hitting the same link in the fence is lower than, and especially if you're able to repair the link before it gets hit again. Let's talk about that for a moment. The idea was, the probability is lower for the charged particles to hit the same memory location twice, and if we can detect that there's an error, and correct it before there's a second particle that interacts with the same byte or word that's protected, then we basically have a very reliable system. We're certainly single fault tolerant, so for a single fault, we can basically correct, and for a double fault, we can basically fail safe. This single fault, since we can correct, we have memory out here, and we have, say, a word out here which I'll draw as a location in memory, and we have a bit flip here, and it's a one and it becomes a zero. When we read in memory, so I'm going to read, we go through the ECC block before we go to the CPU. If there is a bit error detected, what happens is the ECC will interrupt the CPU with an SBE interrupt, and the CPU at that point, it depends on exactly how the ECC works, but the CPU will get the correct data, so it'll get the corrected data over here, so it'll get the one that was originally there before that was flipped to a zero. It will also get the interrupt. What it can do is, it can take this and it can write it back through ECC, and that gets written back to this location, and in fact, it gets restored back to the one on the right back. We can corrects on read. We write back correct value. Now, you could imagine that what we could do is just have the ECC when it corrects on the read through the ECC error correction code digital logic, that it could immediately just write this back instead of relying on software to do that. Whether it does that or not, is the design of the ECC. Some will, some won't. But the basic idea is that with the hamming code that I'll show you, we can always detect, and correct a single bit error, and we can either do that by writing it back through the ECC, or the ECC can automatically correct it. But either way, it'll let the CPU know with an interrupt. This could be maskable. We wouldn't want to mask it if we need to write it back to correct it. But if the ECC just automatically corrects it, then we don't necessarily need to know unless we just want to do something like bump a counter to count the number of SEUs we get. It turns out, if you think about it, if you go from top to bottom of memory, the more you read all of your memory more frequently, the more often you find SPEs before they turn into DBE. In other words, you would just go through and scrub the memory. So we want to a scrubber if we have time for it, which would be perfect for like a slack stealing service or a best effort service that runs in the background, that just goes through and reads memory. That won't really, that wouldn't cause any harm. It's just a little bit of housecleaning. That will prevent the SPEs from getting a second hit in the same location, which is a low probability. You're more likely to get a hit in a new location, like up here, where a one becomes a zero, and not in exactly the same location. But if you just leave the errant data there, uncorrected, then of course there's more chance that it will become a double-bit error in the same word where this zero will become a one as well. The correction occurs on the read. The ECC doesn't walk memory automatically, it normally is driven by the CPU making read requests. Therefore, it makes sense to have a scrubber task. Now, I'm sure it's possible that we could make an ECC that just works memory, but that's probably something we don't want to do. We probably want the memory reads to be under the control of the CPU. Let's summarize what we want out of this, the theory of operation for error correction, code memory. We want SECDED. Let's talk more specifically what we mean by that. We want error detecting and correcting memory. Typically users extended parity encoding of data. Simple parity, I'm sure you're familiar with. We have odd or even parity. We just count the number ones if it's even, and our policy is even parity, we set a one for "Yes", we have even parity. If it's not number ones, we would not set it. Meaning that it was odd for our even parity scheme. We could also do the same thing, but with odd parity. Hamming and SECDED goes way beyond that. It's derived from encoding that was first suggested by Hamming and requires additional bits for every word in memory. Typically, you can implement a perfectly good SECDED code by having one extra bitline for every 32-bit word. So 40 bits for every 32 bits, and that's very common. So it's basically 25 percent overhead. Of course, you have a much more reliable system, especially if you're operating in an environment where you might have bit-flips or single event upsets. It detects and corrects a single-bit error. That's the SEC, single error detection and correction. The DED detects, but does not correct, double-bit errors. It really can't be trusted for anything beyond a double-bit error. This, as we said, is going to fail-safe. Others were going to do something like reboot, and boot a completely different, redundant computer. This is going to correct. We really don't know what's going to happen here, we really can't trust this. This might be probable, I guess, is what I would say. In other words, we're likely to see SBEs. Unlikely, because we would fail safe to recover from that, and this is almost never. As I think we've already discussed, this would be a single fault, this would be a double fault and this would be a triple fault. So it meets our general design criteria for highly reliable, highly available systems that we're going to talk about for memory. This helps us quite a bit, in terms of coming up with a mission critical system design, as well as real-time predictable response design. A special set of bits, the extra bits that I've mentioned, in other words, the 32 plus the parity bits. There's a special set of bits, in addition, that are more like register bits that are computed based on the parity bits which are distributed in with the data. Every time you read the data. If there are bits that have flipped, it's going to give me the position of the bit that flipped. What I need to know is, which case I have. Do I have SBE? In that case, I'm going to also have a position so I can correct the bit, flip it back, essentially. Or do I ever DBE? In which case I need to fail safe. Basically, stop what I'm doing and allow the hardware watchdog to reset me and come back up, say on second duplicate computer that hasn't been affected. A failover as it's called. We shouldn't really ever get beyond the double-bit error because of probability and the fact that we're scrubbing memory, so we're not going to allow SBEs to become double-bit errors and the only way we're going to get a double-bit errors to be very unlucky where there's a second hit in a very small area. An area the size of the flip-flop gate array or the SRAM cell that has one word encoded. Depending on the exact device physics of the devices we use, of course, this will vary, but it's going to be low probability, so this would be a triple fault. We're just going to say that should never happen and if it does, we probably have lost our system. But it's like anything in life, triple fault is something that really we normally just say, well, that will result in failure, but we need to be able to recover from a single fault because that actually is likely to happen. We need to be able to fail safe and recover over a longer period of time or maybe with intervention in the case of a double fault, which is highly unlikely, but still not other realm of possibility. The syndrome allows us to correct and the SECDED encoding allows us to fail safe. We're going to do scrubbing to try to lower the probability of SBEs becoming DBEs because the air just sits there. That's the basic theory of operation. The software interface to the hardware, first of all, goes through the ECCs as I've drawn and automatically corrects SBEs on read and implements SECDED. This may not correct the actual memory location, so the software may need to write back, as we've discussed, depends on the design of the ECC. It typically raises an interrupt on SBE, maybe you just want to count plus, plus how many you're getting for stats, even if it does automatically correct it. Allows the firmware to correct the location with a write-back, if that's necessary. Location stored as status register or a log for the counter raises an exception, non maskable interrupt on a double- bit error. The system halts and the reason for that is the very next thing you do might cause you to lose your entire spacecraft or aircraft because you would be executing bad code or code with bad data. Then what? Well, really probably what we're going to do is fail safe, as we've said, and let the hardware watch-dog bail us out by rebooting this and coming back up through a redundant computer. Each memory in the hierarchy must have extended period parody for real reliability or at least simple parody. The main working memory is large, it's going to be megabytes gigabytes, so it's much more likely for us to get an SEU in the main working memory. The chances of getting an SEU in a register or cache is lower, because it's smaller. We might just have simple parody there and take our chances. If there's a single bit flip at least will detect it, and we can reload cache because cache is in our original copy. In fact, we just mark it dirty and it'll get reloaded, and we won't have a problem. Simple parody actually works pretty well for cache, and we're also accessing cache very frequently, so that helps a lot too. That's what I've seen used as a solution for cache and registers is just a simple parity bit. We're going to get through the hamming encoding theory, but I think to really understand it, what we have to do is do some examples of writes and basically read after write. Read after write scenarios for each of the possible errors we can get. This is actually a diagram that's typically used to show what hamming is. This is the overall parody, and this is parody associated with a specific data bit, namely D1, and we have different parody associated with I believe that one's D4. We have P3, so it's hard to tell what it is, this is a conceptual diagram. We'll see exactly what it is, but in other words, data bits have parody distributed inside them and then we have parody for the whole word. That's for the whole thing, and that's parody for just the D. This is parity just for d_2. This parity just for d_3, and I guess it's probably also parity for d_4. It turns out that we don't need a one-to-one between parity and data bits. That would be really high overhead. That would be basically like mirroring. Mirroring would be another possibility, that would be one-to-one. The nice thing about hamming is that we have a much higher information rate of 82.05 percent. It basically takes 39 bits to protect 32-bits sec dead. We would probably actually use 40 for 32. While the theoretical information rate is 82.05 percent, might be lower in practice. pW, would be this here, and the distributed parity bits would be these interior sets here. So this is a conceptual thing. Compute the parity bit for bit sub-fields. Parity arrows indicate the field location based on the intersection of the sub-fields as indicated here, that locates the bit to fix. Let's see exactly how that works. Turns out, we have four basic rules. If our check-bits, or syndrome is zero, and the parity of our encoded data, basically the parity distributed parity bits, and the data is equal to the parity that we recompute, then we have no error. We don't see any indication of flipping of bits, we have nothing to do. That's going to be the normal case. We're going to write in data, we're going to read it back, and when we get check-bits are zero, parities that we originally computed, and parity we recompute are the same, there's no error. The next scenario are check-bits syndrome. I just call them check-bits. The syndrome is not equal to zero, and our parity has changed. Well, parity has changed, we know that's an odd number of bit flips. Furthermore, if parity has changed, and the check-bits are not zero, then that's an SBE. Because it's an odd number of bit flips, namely one, and the check-bits, in fact, based on the design are going tell us which bit flipped. That's very different than an MBE. The interesting thing about an MBE is, if it's even actually the parity could have no error, and if it's odd, we would see a parity disagreement for the entire word and the check-bits would be greater than the maximum, most often. Although it could also be just misleading honestly. But remember this is extremely low probability, so this is probably never going to happen. There's a lot of things that fall in that category that could go wrong. This is triple fault, very unlikely. It's more likely that this is number two, than it is number five in all of these cases. Number three is check-bits are zero, and the parity for the word computed does not equal to the original parity. This is a pW error, and we just recompute parity for the whole encoded word. Distributed parity bits and data bits included. The final case is of check-bits are not equal to zero, and they're the same. If the parity is same, but we have an indication of an error, then we know that was an even number of bit flips. Again, it could be this one down here, but we know this is so unlikely that it's more likely that it's this one. We have very definite cases of this, and this. This one, we basically know there's a little question mark that it could be number five in each one, but the assumption is, it's going to be number two, or number four. This is almost a perfect detector, in the sense that this is so improbable that it's unlikely that there's going to be a false positive, or a false negative. So we just assume that it's one of the four cases. We can certainly show that those cases are definite. In other words, when we introduce those errors, we definitely fire one of these four rules. The only problem we have is if we go and we make three flips, four flips, five flips, we may falsely fire one of these rules. In other words, get a false positive. The false positive really would do no harm, because we've had a fairly catastrophic failure anyway, we've had three or four bits flip. Really we would just probably go into a check stop, a fail safe. That's the basic theory. What we want to do now is see how hamming code is actually put together, how it's formulated. We'll do that next. Here's the first example, but we'll take a break, and then jump into that.