In this video, we're going to prove our first performance guarantee on the union-find data structure with path compression. This is the bound first established by Hopcroft and Ullman. Let me briefly remind you about the guarantee. Consider a union-find data structure where you're using lazy unions. You're doing union by rank and you're using the path compression optimization. Consider an arbitrary sequence, say of m union and find operations. The guarantee is that over the course of this sequence of operations, the total work that you do is at most m, the number of operations, times the glacially growing function log* of n. Remember, log* n is defined as the number of times you have to apply the logarithm function to n before you get a result which is below 1. Remember, 2 raised to the 65,536, a totally absurd number, log* of that number is merely five. So the theorem is true no matter what m is, no matter how many operations you do, whether you do very few, whether you do a whole lot of them. I'm going to focus on the interesting case where m is at least asymptotically as big as n, where m is omega of n. I'll let you think through in the privacy of your own home why the arguments that we're about to see imply the theorem no matter what m is. So one final comment about this guarantee before we turn to its proof. Notice what the theorem is not saying. The theorem is not saying that every single one of these find and union operations runs in big O of log start and time. And that's because that statement in general, which would be stronger, that statement is false. There are going to be operations in general that take more than log* n time. Now on the one hand we know no operations going to be slower than logarithmic time. Even when we didn't have path compression, no operation was worse than the log n time. And we're not going to do any worse with path compression. So the worst case time bound is log n but some operations may indeed run that slowly. But over a sequence of m operations, the average amount of work we do on a per operation basis is only log* of n. This is exactly the sort of so called amortized analysis that we did in our very first union-find implementation with eager unions. We agreed that particular unions might take linear time, but over a sequence of unions we spent only logarithmic time. So the same thing is here, with the exception that we're getting a radically better on average running time bound of log* n over a sequence. So before I launch into the proof details, let's just spend one slide talking a little bit about the proof plan. And in particular, what is the intuition behind the performance that we're trying to make precise? That we're trying to encode in a mathematical way in the proof that's about to follow? Well if we're hoping to prove a bound better than log n, what we were stuck with without path compression it better be the case that installing all these short cuts really qualitatively speeds up finds in unions. And on some level it's sort of clear that things have to be sped up. We're replacing an old chain of pointers with a single pointer so you can only go faster. But how do we keep track of this progress? How do we compile this intuition and make it into a rigorous guarantee. So here's the idea. Let's focus on an object x which is no longer a root. At this point it's parent is somebody other than itself. And one thing I want you to remember from our union by rank analysis is that once an object is not a root, its rank is frozen forevermore. Now we proved that in the context when there was no path compression. But again remember we manipulate ranks in exactly the same way when there is path compression. So that's still true. If you're an object and you're no longer a root there is no way you're rank will ever change again. Now what we're certainly hoping is true is that finds that originate, at say, at this object x are running fast. Not only that, they should be getting faster and faster as time goes on. Because we're doing more and more path compression. So here's the big idea. The way we're going to reason about the worst case running time of the find operation, or equivalently, the longest sequence of parent pointers we might have to traverse to get to a root, is we're going to think about the sequence of ranks that we observe as we traverse these parent pointers from an object x up to the root. Let me give you an example, so what would be the worst case situation? Well the worst case would be if we have a data structure, and let's say that the maximum rank is something like 100, the worst case sequence of ranks, the longest sequence we would ever see, would be if we start a find operation at an object with rank zero, we traverse a parent pointer, we get to its parent, it has rank one. We traverse its parent, it has rank two, then three, then four, and so on all the way up to 100. Now remember, ranks have to strictly increase whenever we traverse a parent pointer. As we discussed, that was true with or without path compression. So in the worst case situation, to go from 0 to 100, you have to traverse 100 pointers. So that'd be a bummer. Wouldn't it be nice if every time we traversed a parent pointer, the rank increased, not just by one, but by a much bigger number. So for example, if we went from 0 to 10, to 20, to 30, to 40, and so on, that would guarantee that we'd get to the max ranked node 100 in only ten steps. So again, the bottom line is, if we can have a better, larger lower bound on the gap in ranks between objects and its parent, that implies more rapid progress through the possible ranks of the nodes that we can see. And it translates to faster finds, to fewer parent pointer traversals. So with that idea in mind, the big gaps between ranks imply rapid progress, I want to propose as a progress measure for a given non-root object x. The gap between x's rank, which again remember is is frozen forevermore, and the rank of its current parent. So this progress measure is a good one for two reasons. First as we just discussed, if you have a handle on this gap, if you can lower bound it then that gives you an upper bound on the search time. Secondly, this gap allow us to quantify the benefit of path compression of installing these shortcuts. Specifically, whenever you install a new short cut you rewire an object's parent pointer to point higher up in the tree. Its new parent is going to have rank strictly bigger than its old one. That means this gap is only going to get bigger. Summarizing path compression improves this progress measure. That is if an object x previously had a parent p and then has its parent pointer rewired to a different node p prime the rank of p prime is bigger than the rank of p. And just to make sure this is crystal clear let's just draw a couple of cartoon examples. So first just in the abstract, if you think about an object x. Suppose it has some parent p and suppose the root of the tree is some p prime. So some ancestor strictly further up in the tree. Remember, ranks always increase as you go up the tree, as you follow the parent pointers. So that means p's rank is strictly bigger than x. p prime's rank is the important thing, is strictly bigger than p. So when you rewire x's parent pointer to point from p no longer but rather to p prime, it acquires a new parent, and its new parent is an ancestor of its old one, ergo it must have strictly bigger rank. For that reason, the gap between x's rank, which is fixed forevermore, and the rank of its new parent. That gap is bigger than the gap between its rank and the rank of its old parent. You can also see this effect in action with the running seven object example we were using in the last video. So I've shown that example, that tree in pink, both before and after path compression. And, again, before path compression, I just defined the ranks as with union by rank so the rank of each node is equal to the longest path from a leaf to that node, and then of course we don't change the ranks when we apply path compression. And what do we observe? Where exactly two of the objects had their parent pointer rewired namely objects one and four had their parent pointer rewired to point directly to seven. And as a consequence the gap in rank between object one and its parent has leaped from merely one, the difference between each rank and the rank of four up to three, the difference between its rank and that of its new parent seven. Similarly object four, its rank gap has jumped from one to two. Its rank is only less than its old parent six but it's two less than that of its new parent seven. So believe it or not we actually have the two crucial building blocks for the Hopcraft and Ullman analysis. Building block number one is the rank lemma that we discussed a couple of videos ago, the fact that with or without path compression you can't have too many objects with a given rank r, you can only have at most n over 2 to vr. The second building block is the one we just covered, that every time you update a parent pointer under path compression, the gap has to grow between the rank of that object and the rank of its new parent. The rest of the proof is just an optimal exploitation of those two building blocks, let me now show you the details.