The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

Loading...

来自 斯坦福大学 的课程

Greedy Algorithms, Minimum Spanning Trees, and Dynamic Programming

401 评分

The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

从本节课中

Week 2

Kruskal's MST algorithm and applications to clustering; advanced union-find (optional).

- Tim RoughgardenProfessor

Computer Science

So now we understand why Kruskal's algorithm is correct, why it always

computes a Minimum cost-spanning Tree. In this video, we'll turn our attention

to implementation issues. We'll begin with a straightforward

implementation of Kruskal's algorithm. That'll give us a polynomial run time

bound which is good but we'd like to do better.

So then we'll show how deploying a suitable data structure, something you

haven't seen before, the Union-Find data structure, allows us to speed up

Kruskal's Algorithm to be competitive with Prim's Algorithm.

That is we'll get a near linear running time bound of MlogN.

So let's just briefly review the very elegant pseudocode for Kruskal's

Algorithm, so it's a greedy algorithm.

It considers the cheapest edges first, all the way up to the most expensive.

So we begin with a sorting pre-processing step to put the edges in sorted order for

notational convenience let's just rename the edges. So, that one is the cheapest

edge, and, all the way up to M being the most expensive edge.

We then have our single linear scan in this for loop, and we just grab edges

whenever we can, okay?

So we maintain this evolving set capital T, which at the end of the algorithm will

be our spanning tree. Now, what forces us to exclude an edge

from this set capital T? Well if it creates a cycle with the edges

was already chosen, obviously that's a no go.

We can't have cycles in our final output, but as long as we don't have a cycle from

including an edge, we go ahead and optimistically include it.

And as we've seen, this is a correct algorithm, it always outputs the minimum

cost spanning tree. So, what would be the running time of

this algorithm? If we just straightforwardly implement

the pseudocode on this slide? Well, let's just considered the algorithm

step by step. In the first step, we sort the edges,

so that's going to take M log N time. Now don't forget whenever we're speaking

about graphs, we have the convention that M denotes the number of edges and N

denotes the number of vertices. So, you might justifiably wonder why I

wrote M log N for the running time of the sorting step instead of M log M, since

after all what it is we're sorting are the edges and there's M of them.

Well, what I'm using here is that, in this context we can switch log N and log

M interchangeably with each other inside a big-O notation.

Why is that true? Well recall when we first discussed

graphs in part one, we noticed that there can't be too many edges.

So the number of edges M, is the most quadratic of the number of vertices,

it's the most big-O of N squared. So if M is at most M squared, then log M

is at most two log N And the two is suppressed in the big-O notation.

So log M and log M are interchangeable in this context.

Notice that for the minimum cost spanning tree problem you may as well assume that

there's no parallel edges. You may as well assume that the graphs

are simple. If you have a bunch of parallel edges

between a given vertices, you can just throw out all but the

cheapest one. That's the only one you'll ever need.

So, moving on to the main loop, pretty obvious how many iterations we have

there. We have M iterations.

So all we need to figure out is how much work we have to do in each iteration.

So what is it each iteration needs to accomplish?

It needs to check, whether or not adding the current edge to the edges we've

already chosen creates a cycle or not. So I claim that can be done in time

linear in the number of vertices. That is it can be done in the big-O of N

time. So how do we accomplish this?

Well, we need two quick observations. First of all, and this is something we've

seen in arguments in the previous videos, checking whether or not this new edge is

going to create a cycle. Say the edge has end points U and V.

Checking for a cycle boils down to checking whether or not there's already a

path between the end points U and V, and the edges capital T that was chosen so

far. If there is a UV path already, adding

this edge will close the loop and create a cycle.

If there currently is no UV path, then adding this edge will not create a cycle.

So the second observation is well, how do we check if there's a path from U to V,

in the edges we've already chosen? Well, we already know how to do that

just. Using graph search.

So you can use breadth for a search, you can use depth for a search.

It doesn't matter. You just start at the vertex U and you

see if you have a reach of V or not. If you reach it, there's a path.

If you don't reach it, there's not a path.

Breadth-first-search, depth-first-search, whatever.

It takes time linear, in the graph that you're searching.

And since we only need to search for the edges that are in capital T.

And there's going to be, at most, N minus one of them.

Linear time in this context means O of N. O of the number of vertices,

because that also bounds the number of edges that might be in capital T.

The edges that we're searching for are pathing.

So, adding up all of this work, what do we have?

We have this sorting pre-processing step. That takes time big-O of M log N,

then we have these M iterations of the four loop like this is takes O of N times

factor. the latter term dominates, so the overall

running time is big-O of M times N. So this, coincidentally, is the same

running time we got from the straightforward implementation of Prim's

algorithm, and I'll make the same comments here.

This is a reasonable running time, it's polynomial on the input size.

It's way better than checking all exponentially many spanning trees that

the graph might have. But we certainly would like to do better.

We'd certainly love to have implementation of Kruskal's algorithm

that gets us to a near linear running time bound, and that's the plan.

How are we going to do it? Well, really the work that we're doing

here over and over again, which is kind of a bummer, is these cycle checks.

And every single iteration, we're spending time linear in the number of

vertices to check for a cycle. And the question is,

can we speed that up? And the Union-Find data structure will

actually, believe it or not, allow us to check for a cycle in constant time.

So if we actually had a data structure that could implement constant time cycle

checks. Then we'd have to spend only constant

time for each iteration of this while loop.

So the loop overall would take only linear time in the number of edges, O of

M edges. If we got that, then believe it or not,

the sorting pre-processing step would become the bottle neck in the running

time of Kruskal's algorithm. Our running time would drop from N times

N down to near linear, down O of N log N.

So let me now tell you a little bit about this magical data structure that's going

to give us constant time cycle checks. I'm just going to give you the high level

picture, and how it connects to Kruskal's algorithm on this slide.

We'll look at the details of the data structure, in the next video.

I also want to warn you that I'm not going to discuss, in this pair of videos,

the state of the art for Union-Find Data Structures.

I'm going to give you a fairly primitive version,

but that is nevertheless, sufficient to give us our desired M Log N running time

of Kruskal's algorithm. So if you're interested, there is some

optional material about different implementations of Union-Find that use

some super cool ideas, like union by rank and path compression.

That give you different, and in some senses, better operation times.

But the quick and dirty version of Union-Find that I'm going to discuss

here, is sufficient for our present needs.

And so the Raison d'être of a Union-Find Data Structure is to maintain a partition

of a set of objects. So in this picture, the overall rectangle

is meant to denote a set of objects and then C1, C2, C3, and C4 are disjointed

subsets that together form the union of the entire set.

So that's what I mean by a partition of a group of objects.

We're not going to ask too much of this data structure.

We're only going to ask it to support two operations.

No prizes to guess what those two operations are called.

So the find operation, we give it an object from this universe and we ask the

data structure to return to us the name of the group to which that object

belongs. So for example, if we handed it something

in the middle of this rectangle, an object, we'd expect it to return to us

the name c3. The union operation by contrast, takes as

input, the names of two groups. And what we want the data structure to do

is to fuse those two groups together. That is, the objects in the first group,

and the objects in the second group should all coalesce, and be, now, in one

sole group. So why might such a data structure be

useful for speeding up Kruskal's Algorithm?

To see the connection, think of Kruskal's algorithm as working conceptually in the

following way. So initially, when the algorithm starts

in the Set capital T is empty, each of the vertices is by it's own,

it's on its own isolated component. And then each time Kruskal's adds a new

edge to the set capital T. What that does is it takes two current

connected components and fuses them into a single connected component.

So for example, toward the end of Kruskal's algorithm that maybe it's

included enough edges, that now the Tree capital T that is constructed so far has

only four different connected components. And maybe it's about to add a new edge U

comma V, where of course U and V should be in different connected components with

respect to the edges chosen for far. So this new edge addition at this

iteration of Kruskal is going to fuse the connecting components of U and V into a

single one. So that corresponds to taking the union

of the groups to which U and V respectively belong.

So to be a little more precise about it. So what are going to be the objects

contained by the Union-Find Data Structure in Kruskal's algorithm?

We're going to correspond to vertices. It's the vertices coalescing that we want

to keep track of. So what are going to be the groups in the

partition that we maintain? They're just going to correspond to

connected components with respect to the edges that Kruskal's algorithm has

already committed to. And with these semantics, it's clear that

every time Kruskal adds a new edge to its set capital T, we have to invoke the

union operation to fuse two connected components into one.