The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

Loading...

来自 斯坦福大学 的课程

Greedy Algorithms, Minimum Spanning Trees, and Dynamic Programming

401 评分

The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

从本节课中

Week 2

Kruskal's MST algorithm and applications to clustering; advanced union-find (optional).

- Tim RoughgardenProfessor

Computer Science

In this video we'll provide for the first time concrete evidence that the lazy

union approach to the Union-Find data structure is viable.

Specifically, we'll prove that the worst case running time of both the find and

the union operation is logarithmic in n, the number of objects stored in the data

structure. We are going to do even better later

once, we introduce a second optimization known as path impression.

But an important stepping stone is to understand just is why, just union by

rank, already gets us to a reasonable algorithmic run-time.

So a quick review of the lazy union approach to implementing the union fine

data structure. So, with each note we're going to

maintain a parent pointer. And it's no longer the case that we

insist the parent pointer point directly to the leader of a group.

Rather we just insist that the collection Collection of parent pointers, induces a

collection of directed trees. The root of each tree, that is, the node

which is it's own parent, we're going to define as the leader of that group.

So, given any old object, x, how do you implement find? How do you, figure out

what the leader vertex is? Well, you just traverse parent pointers, up until you

get to the root of that particular group. So for this implementation of the Find

operation, the worst case running time is just going to be the longest path of

parent pointers that you ever have to traverse, to get from an object to some

root. So the way we're going to quantify that

is using these ranks. So this is again, a field that we

maintain for each object. And for now, this will break down later,

but for now before we have path -pression, we're going to maintain the

invariant that the rank of an object x is exactly the largest number of pointers

you have to traverse, from some leaf, to get to x.

As a consequence, the biggest rank of any object is the longest path from any leaf

to any root. And that's going to be an upper bound on

the worst case running time of the find op- Operation.

So let's move on to the union operation. So here, given 2 objects, x and y, you

need to fuse their 2 trees, their 2 groups, so you find the roots of the 2

trees, so you call a find on x, you call a find on y.

That gives you their 2 respective roots, and now you install 1 as a new child.

Of the other. Now we saw in a quiz in the last video,

if you're not careful about which root you install as a child under the other,

you can wind up with these long chains. And be stuck with a linear worse case

time for both find and union. So instead we have this union by rank

optimization. Which says, well, we want to keep our

trees from getting scraggly. And the way we're going to do that is.

When we have a shallow tree and a deep tree we make the shallow tree shall under

the root of the deep one, that prevents the tree from getting even deeper.

Now there is a situation where the two trees have exactly the same depths that

is where the two roots have exactly the same rank, in that case we just proceed

arbitrarily. Then when we merge two trees that both

had a common rank r, its important that in the new tree, the rank is gone up to

r+1. So we need the update, we need the

incremental rank of the new root to reflect that increase.

So that's where we've already been. Where are we going to next? Well the plan

for this video is to show that, with the union by rank, optimization.

The maximum rank of any node, is always, bounded above, by log, base 2 (n).

Where n is the number of objects in the data structure.

Now we just said, the worst case running time of find, is governed, by the maximum

rank. So the logarithmic maximum rank means

logarithm run time of find. that also carries over to the union

operation. Remember union is just 2 finds plus

constant work to rewire 1 pointer, so that's going to give us algorithm time

value on both operations. So let's see why that's true.

So let's begin the analyses with a few simple, but useful properties that follow

immediately from our invariant. From the way that we change the ranks of

objects as we do finds and as we do unions.

So the first simple property is focus on your favorite object.

x. And, watch this objects rank change, over

the course of the data structure, as we do finds and unions.

How can it change? Well, when we do a find we don't change anything, all the

ranks stay the same. When we do a union, all the ranks stay

the same. Well, except there is 1 case in the

union, where the rank, of a single node, gets bumped up by 1, gets increased.

So ranks only go up, over time, for all of the Objects, that's property one.

So the second property is again pretty much trivial, but really, really useful.

So what is the situation in which somebody's rank gets bumped up by 1?

We're going to take the union of two trees that have a common rank.

And then whichever of the two roots that we pick to be the root of the new bigger

tree That's the object whose rank gets bumped up by 1.

So new roots of this fused tree. So in particular, the only type of

objects that can ever get a rank increase is a root.

If you're not a root, your rank will not go up.

Furthermore, once you're not a root in this data structure, you will never be a

root again in the future. There is no process by which, you shed

your parent. Once you have a parent other than

yourself, you will always have exactly that parent.

Putting those two observations together we find that, as soon as an object X.

Becomes a non root but as soon as it has a in parent other than itself it rank is

frozen for the rest of time forever more. The third and final simple property

follows from a formula we mentioned in the last video about computing ranks.

So remember the rank of a node in general is going to be one more than the maximum

rank of any of its children. So if you have a child and there is some

path from a leaf to that child, it takes 5 hops.

The path to you from that child is going to take 6 hops.

As a consequence as you go from the leaf up to the root you will see a strictly

increasing sequence of ranks. The rank of a parent is always strictly

more than the rank of all of those children.

So that's it for the immediate properties.

Let's go to a property which is a little less immediate.

But still this next lemma, which I'm going to call the rank lemma, it's the

best kind of lemma. So on the one hand, it's just not that

hard to prove. I'll give you a full proof in the

following 2 slides. On the other hand, it's really powerful.

It's going to play a crucial role in the analysis were doing right now.

A logarithmic run time bound, with a union by rank optimization, and we'll

keep using it again as a workforce, once we introduce path compression, and prove

better bounds on the operations. So what's the rank limit say? Well it

controls the population size of objects, that have a given (no period) Rank, so we

want it to apply at all intermediate stages of our data structure, so we're

going to consider an arbitrary sequence of unions.

You can throw in some finds as well. I don't care.

Finds don't change the data structure, so they're totally irrelevant, so think

about a sequence of unions, a sequence of mergers.

The claim is, for every non-negative integer, r.

The number of objects that have rank exactly r at this time is at must n.

the total number of objects divided by 2 to the r.

So for example, if our rank is 0. It says that at must n objects have rank

0, so it is a trivial statement because only n objects.

But at any given time the number of objects that have rank 1 is at most n

over 2, the number of objects that have rank 2 is at most no over 4 and so on.

And if you think about it, if we succeed in proving the rank Lemma, we're pretty

much done, showing the efficacy of the union by rank optimization.

So in particular, once you take r, the, in this, key Lemma, to be log base 2 (n),

it says that there's at most 1 object. That has rank log2(n).

And there can't be any objects that have rank strictly larger.

That is, this limit implies that the maximum rank at all times is bounded

above by log2(n). And remember, the maximum rank is the

longest path of pointers, traversals, you ever need to get from a leaf to a root.

And that means the most amount of work we'll ever do in a find, and therefore,

in a union is O (log n) Okay, so I've now teased you with the consequences of the

rank lemma, assuming that it's true, but why is it true? Let's turn to the proof.

I'm going to break the proof down into two claims, claim one and claim two.

We'll see that the two claims easily imply the rank Rank Lemma.

So claim 1 asks you to consider 2 objects, X and Y, that have exactly the

same rank R. And the claim asserts that the sub-trees

of these 2 objects have to be disjoint. They have no objects in common.

And here by the sub-tree of an object, I just mean the other objects that can

reach this one by following a sequence. Of, parent pointers.

So that is the subtree at x, is the objects from which you can reach x.

The subtree at y is the objects from which you can reach y.

The second claim, is that, if you look at any object that has rank r, and you look

at it's subtree, that is, if you look at the number of objects that can reach,

this object x by following pointers, there have to be a lot of them.

There have to be at least 2 raised to that objects rank Are, objects in it's

subtree? Notice that, if we prove claim 1 and claim 2, then the Rank Lemma, follows

easily. Why? Well, fix a value for, r.

2, 10, I don't care what. Look at all the nodes that have this rank

R. By claim 2, each of them has at least 2

to the R objects that could reach them. And by claim 1, these have to be disjoint

sets of objects. Well, there's only N objects to go

around, and if each of these disjoint sets has at least 2 to the R of them,

there are going to be at most N over 2 to the R such groups.

That is at most N over 2 to the R nodes, objects with this rank R.

So we've reduced the proof of the rank Lemma to proving claims 1 and 2, I will

do them in turn. So for claim 1 let me go via the contra

positive, that is, I will assume that the conclusion is false, and I will show that

the hypothesis must then also be false. So, lets assume that we have 2 no,

objects x and y, and their subtrees are not, disjoint.

That is, there exists an object z, from which You can reach X and from the same

object Z, you can also reach Y by a sequence parent pointers.

Well now let's use the fact that we're dealing with the directed tree, right, so

if you start with an object Z. There's only a unique parent point, or 2,

follow each time. So that is, all of the objects reachable

from z, they form a directed path, leading up to the root of z's group.

So the only way for both x and y to be reachable from z, they have to both be on

this path. If they're both on this path, then 1 has

to be an ancestor of the other. So now we're going to use the third of

our simple properties that we observed. That is, on every path to the root, ranks

strictly go up, each time. So, whichever of x or y is an ancestor of

the other, that has strictly higher rank. Therefore x and y do not have the same

rank. That completes the proof, of claim 1.

So lets move on to claim 2. Remember, claim 2 is search that, an

object of rank r, necessarily has 2 ^ r objects, or more, in its subtree.

That's how many objects can actually reach this object x, by following Parent

pointers. So for this proof we're going to proceed

by induction on the number of operations, and again remember fine operations have

no effect on the data structure, so we can ignore them.

So it's just by induction on the number of union operations that happen.

So for the base case, when, before we've done any unions whatsoever, we're doing

just fine. Every object has a rank of 0 and the

sub-tree size of every object is equal to 1.

That object itself, also known as 2 to the 0.

Zero. Now for the inductive step, there's an

easy case and a hard case. The easy case is where nobody's rank

changes, where we do a union, and everybody's rank stays exactly the same.

In this case, we're golden. Why? Well, when you do a union, sub-tree

sizes only go up. There's only more.

Pointers so there's only more objects that can reach any given other objects.

So sub-tree sizes go up, ranks stay the same.

If we had this inequality of sub-tree size as being at least 2 ^ r before, we

have it equally well now. Now.

So the interesting case is when somebody's ranked actually changes.

How can that happen? Well it happens in only one particular way that we

understand well. Looking at a union operation between

objects X and Y. Suppose the roots of these objects are S1

and S2 respectively. It's only when these.

Two roots have the same rank, let's call that common rank R, that somebodies rank

gets changed. In particular, we're going to break ties

as we did in the previous video. S2 will be the root of the fused tree, S1

will become a child of it. And, in that case, S2s rank gets bumped

up by 1. It goes from R To r + 1.

Now notice, in this case, we do have something to prove.

What are we trying to establish? We're trying to establish that every subtree

size is big, as a function of the rank. So, s2's rank has gone up, and therefore

the lowerbound, the bar that we have to meet, for the subtree size, has also Gone

up, it's doubled. So in this case we actually have to

scrutinize s2's new sub-tree. So what is its new sub-tree? Well, it's

really just composed from its old sub-tree, and it inherits s1 and all of

its sub-trees. Well, in that case, we know that s2's new

subtree size, the nubmer of no objects that can reach it, is just, it's old

subtree size, plus the old subtree size, of s1.

But then w're in good shape because we have the inductive hypothesis to rescue

us. So remember, before this union, by the

inductive hypothesis, for every object with a given rank, say r, it had at least

2^r objects in its sub tree. So S1, and S2, both had rank r before

this. Unions, before this union, both of their

subtree were at least two to the r. So as two subtree sizes bounded below by

two to the r plus two to the r, a quantity also known as two raised to the

r plus one. Quite conveniently, r plus one is S two's

new rank, so S two's new bigger rank, its subtree size is still meeting the lower

bound, meeting the target of two raised to, New rank, 2 ^ r + 1.

So that completes the inductive step, therefore it completes the proof of claim

2, that objects of rank r have subtree sizes at least 2 ^ r.

Therefore completes the proof of the rank Lemma, that for every rank r, there's at

most n / 2 ^ r nodes of rank r. And remember the rank Lemma, implies that

the maximum rank, at all times, is bounded by log base 2 (n), as long as

you're using union by rank. And that implies, that with this first

optimization, the worst case running time of union, and find, are both O (log n),

where n is the number of objects, in the data structure.