The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

Loading...

来自 斯坦福大学 的课程

Greedy Algorithms, Minimum Spanning Trees, and Dynamic Programming

409 评分

The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

从本节课中

Week 2

Kruskal's MST algorithm and applications to clustering; advanced union-find (optional).

- Tim RoughgardenProfessor

Computer Science

So, in this video, we're going to prove Tarjan's inverse Ackermann bound in the

performance of the union find data structure with union by rank and path

compression. Frankly, I hope you're kind of excited.

This is one of the crown jewels in the entire history of algorithms and data

structures. Let me remind you what the bound says.

You've got a union-find data structure, we're doing lazy unions, we're doing

union by rank, we're doing path compression.

And for an arbitrary sequence of m find a union operations. The total amount of

work you do over the sequence of operations is bounded above by m times

alpha of n, where n is the number of objects in the data structure and alpha

is the inverse Ackermann function that we defined in the previous video.

I want to give a shout out here to Dexter Kozen's beautiful book the Design and

Analysis of Algorithms. It is his analysis that I will be

following closely here. So, it's true that merely stating,

Tarjan's bound is non-trivial. We have this entire video defining the Ackermann

function and it's inverse so we could make sense of the statement.

That said, we're well positioned to approve Tarjan's bound.

In fact, the template of the proof is going to very much mirror what we already

did for the log star analysis by Hopcroft and Ullman.

So in that spirit, let's review what were the two basic workhorses, the two

building blocks that drove forward the Hopcroft-Ullman analysis.

So the first building block is the rank Lemma, which dates back all the way to

the videos pre-path compression. And remember the rank Lemma gives us

control on how many objects can exist with a given rank.

And in particular, that upper bound is decreasing exponentially with the rank,

it's n/2^r objects at most with the given rank r.

But, of course, we need a second building block to exploit the facts that we're

using the path compression optimization. So, how do we quantify the progress made

by path compression? Well, we argue that every time a node has

its parent pointer updated in path compression, it acquires a new parent

with strictly larger rank. So, you can think of the Hopcroft-Ullman

analysis as a clever and essentially optimal exploitation of these two

building blocks. How did we use them? Well, we define the

notion of rank blocks and each rank block has size to raise to the size of the

previous rank block. Why were rank blocks useful? Well, they

allowed us to measure progress as we followed parent pointers.

We defined an object to be good, if following it's parent, catapulted you

into a subsequent ranked block. Because there's only a log star number of

ranked blocks, you can only visit log star good nodes on any given find

operation. So, that gave us a per operation bound on

the log star for visits to good nodes. And then, for visits to bad nodes, we use

the second building block to argue that every time you visit a bad node, the rank

of its parent increases. And that can only happen so many times

before the rank of this object's parent is so much bigger than this object's rank

itself that the node has to be good. You have to make a lot of progress by

following its parent pointer. And the point is, if we want to do better

than the Hopcroft-Ullman analysis, we're not going to do better by taking these

same two building blocks and exploiting them in a better way, that can't be done.

So, what we need to do is have a better version of one of the two building

blocks. So, the key idea in Tarjan's analysis is

to have a stronger version of the second building block.

We're going to argue that path compression doesn't merely increase the

rank of nodes parents by one, but in fact, it increases typically the rank of

an object's parents by a lot. And what's kind of amazing is the link of

the proof we're about to do is really basically the same as that in the

Hopcroft-Ullman analysis. And the steps match up almost perfectly.

So, the bound is even better, the argument is even more ingenious.

It's even harder to imagine how one would come up with this idea oneself. But, as

far as checking it, as far as understanding it, the complexity level is

roughly the same as what we've already done in the log star analysis, so here we

go. So, in this slide, I'm going to give you

a definition. And the point of the definition is to

measure how much progress we make in terms of ranks when we follow a node's

parent pointer. So, the definition here is going to play exactly the same role

that the definition of rank blocks played in the Hopcroft-Ullman analysis.

So, what I'm going to define is a number, a statistic measuring the difference, the

gap, between the rank of an object and the rank of its parent. So, this

definition is only going to make sense for non-root objects x.

One thing I want you to remember from previous videos is that once an object is

not a root, it will never again be a root, and it's

rank will never again change. So, non-root objects have rank fixed

forevermore. So, we're going to use the notation delta

of x for the statistic. And the definition of delta of x is the largest

value of k such that a sub k, this is Ackermann function remember, such that a

sub k applied to the rank of this object x, is bounded above by the rank of x's

parent. So, the bigger the gap between the rank

of x's parent and x's rank, the bigger the value of delta of x.

So, let me mention, talk through some simple examples to make sure this makes

sense, and also review the Ackermann function a little bit.

So, first of all, let me just point out the delta of x is always non-negative for

any non-root object x. So, why is this true? Well remember, and

this goes all the way back to our union by rank discussion, the rank of an

object's parent is always at least one more than the rank of that object.

And the function a sub 0, recall we defined as the successor function.

So, for every object that's not a root, it's always true that it's parent has

rank at least one more than it. And that means we can always at least

take k to be zero and have the inequality be satisfied.

Now, when is it going to be true that an object x has a delta value at least equal

to one? Well, that is equivalent to stating we can take k to be 1 in this

inequality, and the inequality holds. So, let's remember what is the function a

sub one. In the previous video, we discovered that

was just the doubling function. So, we can take k=1 and have this

inequality satisfied if and only if the rank of the node's parent is at least

double the rank of that object. Similarly, an object x has a delta value

equal to at least 2 if and only if we can take k=2 on this right-hand side of the

definition and have this inequality be satisfied.

So, let's recall what the function a sub 2 is.

a sub 2 of r is just r*2^r. So, an object x has delta value equal to

2 if and only if its parent's rank is substantially bigger than its own rank,

in the sense that if it has rank r, its parent's ranks has to be at least r*2^r.

And in general, the bigger the value of delta, the bigger the gap between the

rank of the object x and the rank of its parent.

And, of course, because the Ackermann function grows so ridiculously quickly,

the gap between this object's rank and its parent's rank is growing massively as

you take delta to be larger and larger. So, now that we understand this better,

one thing that should be clear is that the delta value of an object x can only

go up over time, right? So remember, we're talking only

about non-root objects x, so the x's rank itself, is frozen forevermore.

The only thing that's changing is it's acquiring potentially new parents,

through path compression, and each new parents rank is bigger than the previous

one. So, the gap in the ranks is only going

up, and therefore the delta value of this object x can only go up, over time.

So one final comment, you recall on the Hopcroft-Ullman analysis, the way we

defined rank blocks ensured that there weren't too many of them, there were only

log star. Similarly here, the way we're defining

the statistic delta, guarantees that there are not too many distinct delta

values that objects can take on. So, we've already argued that delta is a,

integer that's at least zero, and we can also bound it from above.

At least, for any object x that has a rank of at least 2, which will be good

enough for our purposes. If an object x has rank at least 2, its

value of delta can not be any larger than the inverse acromyn of n, than alpha of

n. And this is really just by the way we

define the inverse Ackermann function alpha, we defined it as the smallest

value of k such that if you apply a sub k to 2, that catapult to beyond end.

And since ranks of objects are bounded above by n, actually they're even bounded

above by log n but in particular by n that means you will never see a delta

value bigger that alpha of n for any object x has the rank at least two.

So, that completes the definition of the statistic data of x.

And, once again, the point of this statistic is to quantify progress through

rank space as we traverse a parent pointer, from an object.

So, in the Hopcroft-Ullman analysis, we used rank blocks for that purpose.

Here, we're using the statistic delta of x.

And now, we have to redefine good and bad objects to reflect this different

measurement for progress, for gaps between ranks of objects, and ranks of

their parents. So, that's what I'm going to provide for

you on this slide, the definition of good and bad nodes.

And, this definition will play exactly the same role as it did in the

Hopcroft-Ullman analysis. For the good nodes, we'll be able to just

get a per operation bound for how many visits we make to them. And then, we'll

be stuck with, and this will be the main part of the proof,

a global analysis arguing that the total number of visits to bad nodes over a

sequence of operations cannot be too big. So, the bad objects will be those that

meet the following set of four criteria. If an object fails to meet any of these

four criteria, it will be called good. So, the first two 2 criteria will be

familiar from the Hopcroft-Ullman analysis.

So, first of all, roots get a pass, and direct descendants of roots also get a

pass. So, to be bad, it must be the case you're

neither a root nor are you the child of a root.

So, what's the motivation for these two criteria? Well, it's exactly the same as

it was in the Hopcroft-Ullman analysis. It's just to ensure that bad nodes, after

they're visited on a find, get their parent pointer updated.

So remember, when you do path compression, after you've done a find

from an object x all the way up to its corresponding root, everybody's parent

pointer gets updated except for the root and except for the child of the root.

Those two nodes keep the same parent pointer.

So, to make sure we have progress, we want to exclude the root and the directed

sentence of the root from being bad, we'll acount for them separately.

The third criterion is not conceptually important, its just for technical

convenience sort of an out of factor the way I've defined the inverse arithmetic

function. We're going to give nodes that have rank

zero or one also with free paths. So, in order to bad, we're going to worry

about the case where the rank is at least 2.

So, the final criterion is really the ingenious one and it's the one that

ensures that when you do path compression typically objects parents ranks will

increase extremely rapidly, not just by 1.

And this criterion says, for an object x to be bad, it has to be the case that is

has some ancestor y. So, an object you get to by traversing

parent pointers from x, that has exactly the same value of delta.

So, if an object x has delta of x equal to 2, for it to be bad, there has to be

some ancestor object y that also has a delta value equal to 2.

And again, if x fails to meet any of these four conditions, then x gets a

pass, x is called a good object.

Now, I said that the role of this definition is going to be to split the

work done into two groups and that we're going to be able to easily bound, even on

a per operation basis, the number of visits to a good node.

So, the next quiz asks you to think about that carefully.

Alright. So, the correct answer is B. It's theta of the inverse Ackermann

function of n. And this is really one of the points, of

how we define good and bad nodes. We guarantee that there can not be too

many good nodes whenever we do a fine operation.

So, to see why this is true, let's just count up the number of nodes along a path

that can fail to meet each of the four criteria.

Well, so the first criterion was we give roots a free pass.

So there's at most one node, that's good because it's a root.

Similarly, on a given path there's at most one object on that path, it's a

direct descendant of the root. Because ranks always strictly go up when

you follow parent pointers that can be at most one object of rank zero and at most

one object of rank one. So, we, and most two objects get a free

pascals of the third criterion. Well, such thing about the interesting

case nodes that good be because they failed to meet the fourth criterion.

How many can that be? Well, let's focus on a particular value of delta.

So, let's say, think about all of the objects along a path that have delta of x

equal to 3. There may well be lots of objects on this

path with delta of x equal to 3, may be there's like 17 of them.

Call the highest to the closest to the root, such object z. Of these 17 objects

with delta of x equal to 3, z is going be the only one that is good.

The other 16 objects with delta of x equal to 3 are going to be bad.

Why? Well, any object other than z has an ancestor, namely z, with exactly the same

delta value. So that, those 16 objects do meet the

fourth criterion they will be classified as bad.

Summarizing for each of the alpha of n possible values, of delta, only the

upper-most object, the object closest to the root with that particular delta

value, can possibly be classified as good.

So, that bounds at alpha of n, the number of objects that fail to meet this fourth

criterion. That gives us the total bound of the

number of good nodes on any path.