So now we understand why Kruskal's algorithm is correct, why it always

computes a Minimum cost-spanning Tree. In this video, we'll turn our attention

to implementation issues. We'll begin with a straightforward

implementation of Kruskal's algorithm. That'll give us a polynomial run time

bound which is good but we'd like to do better.

So then we'll show how deploying a suitable data structure, something you

haven't seen before, the Union-Find data structure, allows us to speed up

Kruskal's Algorithm to be competitive with Prim's Algorithm.

That is we'll get a near linear running time bound of MlogN.

So let's just briefly review the very elegant pseudocode for Kruskal's

Algorithm, so it's a greedy algorithm.

It considers the cheapest edges first, all the way up to the most expensive.

So we begin with a sorting pre-processing step to put the edges in sorted order for

notational convenience let's just rename the edges. So, that one is the cheapest

edge, and, all the way up to M being the most expensive edge.

We then have our single linear scan in this for loop, and we just grab edges

whenever we can, okay?

So we maintain this evolving set capital T, which at the end of the algorithm will

be our spanning tree. Now, what forces us to exclude an edge

from this set capital T? Well if it creates a cycle with the edges

was already chosen, obviously that's a no go.

We can't have cycles in our final output, but as long as we don't have a cycle from

including an edge, we go ahead and optimistically include it.

And as we've seen, this is a correct algorithm, it always outputs the minimum

cost spanning tree. So, what would be the running time of

this algorithm? If we just straightforwardly implement

the pseudocode on this slide? Well, let's just considered the algorithm

step by step. In the first step, we sort the edges,

so that's going to take M log N time. Now don't forget whenever we're speaking

about graphs, we have the convention that M denotes the number of edges and N

denotes the number of vertices. So, you might justifiably wonder why I

wrote M log N for the running time of the sorting step instead of M log M, since

after all what it is we're sorting are the edges and there's M of them.

Well, what I'm using here is that, in this context we can switch log N and log

M interchangeably with each other inside a big-O notation.

Why is that true? Well recall when we first discussed

graphs in part one, we noticed that there can't be too many edges.

So the number of edges M, is the most quadratic of the number of vertices,

it's the most big-O of N squared. So if M is at most M squared, then log M

is at most two log N And the two is suppressed in the big-O notation.

So log M and log M are interchangeable in this context.

Notice that for the minimum cost spanning tree problem you may as well assume that

there's no parallel edges. You may as well assume that the graphs

are simple. If you have a bunch of parallel edges

between a given vertices, you can just throw out all but the

cheapest one. That's the only one you'll ever need.

So, moving on to the main loop, pretty obvious how many iterations we have

there. We have M iterations.

So all we need to figure out is how much work we have to do in each iteration.

So what is it each iteration needs to accomplish?

It needs to check, whether or not adding the current edge to the edges we've

already chosen creates a cycle or not. So I claim that can be done in time

linear in the number of vertices. That is it can be done in the big-O of N

time. So how do we accomplish this?

Well, we need two quick observations. First of all, and this is something we've

seen in arguments in the previous videos, checking whether or not this new edge is

going to create a cycle. Say the edge has end points U and V.

Checking for a cycle boils down to checking whether or not there's already a

path between the end points U and V, and the edges capital T that was chosen so

far. If there is a UV path already, adding

this edge will close the loop and create a cycle.

If there currently is no UV path, then adding this edge will not create a cycle.

So the second observation is well, how do we check if there's a path from U to V,

in the edges we've already chosen? Well, we already know how to do that

just. Using graph search.

So you can use breadth for a search, you can use depth for a search.

It doesn't matter. You just start at the vertex U and you

see if you have a reach of V or not. If you reach it, there's a path.

If you don't reach it, there's not a path.

Breadth-first-search, depth-first-search, whatever.

It takes time linear, in the graph that you're searching.

And since we only need to search for the edges that are in capital T.

And there's going to be, at most, N minus one of them.

Linear time in this context means O of N. O of the number of vertices,

because that also bounds the number of edges that might be in capital T.

The edges that we're searching for are pathing.