0:00

So the goal of this video is to prove the correctness of Kasaraju two-pass,

depth-first-search based, linear time algorithm that computes the strongly

connected components of a directed graph. So I've given you the full specification

of the algorithm. I've also given you a plausibility argument of why it might

work, in that at least it does something sensible on an example. Namely, it first

does a pass of depth first search on the reverse graph. It computes this magical

ordering. And what's so special about this ordering is then when we do a depth first

search using this ordering on the forward graph; it seems to do exactly what we

want. Every indication of depth first search to some new node discovers exactly

the nodes of the strong component and no extra stuff. Remember that was our first

observation, but that was unclear whether depth for search would be useful or not

for computing strong components. If you call depth first search from just the

right place, you're gonna get exactly the nodes of an SCC and nothing more. If you

call it from the wrong place, you might get all of the nodes of the graph, and get

no information at all about the structure of the strong components and at least in

this example this first pass with the finishing time seems to be accomplishing

seems to be leading us to invoking DFS from exactly the right

places. So remember how this worked in the example so in the top graph, I have

shown you the graph with the arch reversed. This is where we first invoked

DFS loop with the loop over the nodes going from the highest node name nine all

the way down to the node name one. And here we compute finishing time that's the

bookkeeping that we do in the first pass so we just keep the running count of how

many nodes we've finished processing. That is how many we've both explored that node

as well as explore all of the outgoing arches and so that gives us these numbers

in the red, these finishing times between one and nine for the various

nodes. Those became the new node names in the second graph and then we reverse the

arches again and get the original graphs back and then we saw that every time we

invoked DFS in our second pass we uncovered exactly the nodes of an SCC. So

when we invoked it from the node 9 we discovered that 9, 8 and 7 those

have a leader vortex 9. Then when we next invoked DFS from 6, we discovered

6, 5, and 1, and nothing else. And then finally we invoked it from 4, and

we discovered 2, 3, and 4, and nothing else. And those are exactly the

three, SCCs of this graph. So let's now understand why this works in any directed

graph, not just this, in this one example. So let's begin with a simple observation

about directed graphs, which is actually interesting in its own right. The claim is

that every directed graph has two levels of granularity. If you squint, if you sort

of zoom out, then what you see is a directed acyclic graph, of course

comprising its strongly connective components. And if you want, you can zoom

in and focus on the fine grain structure with one SCC. A little bit more precisely.

The claim is of the strongly connected components of a directed graph induce in

a natural way an acyclic metagraph. So what is this metagraph? What are the nodes

and what are the arcs, what are the edges? Well, the metanodes are just the SCCs, so

we think of every strong connected component as being a single node in this

metagraph. So call them say 'C1' up to 'Ck'. So what are the arcs in this metagraph?

Well, they're basically just the ones corresponding to the arcs between SCCs in

the original graph. That is, we include in the meta graph an arc from the strong

component 'C' to 'C-hat' if and only if there's an arc from a node

in 'C' to a node in 'C-hat' in the original graph 'G'. So for example if this is your 'C'

and so the triangle is your 'C-hat' and you have one or maybe multiple edges going

from 'C' to 'C-hat', then in the corresponding metagraph your just gonna have a node for

'C', a node for 'C-hat' and the directed arch from 'C' to 'C-hat'. So if we go back to

some of the directed graphs that we've used as running examples so we go back to

the beginning of the previous video and it's look maybe something

like this the corresponding directed acyclic graph has four nodes and four

arches. And for the running example we used to illustrate Kosaraju's algorithm with

the three triangles, the corresponding metagraph would just be a path with three

nodes. So why is this meta-graph guaranteed to be acyclic? Well, remember

metanodes correspond to strong components, and in a strong component you

can get from anywhere to anywhere else. So, if you had a cycle that involved two

different metanodes, that is two different strong connected components,

remember on a directed cycle you can also get from anywhere to anywhere else. So if

you had two supposedly distinct SCCs, that you could get from the one to the

other and vice versa, they would collapse into a single SCC. You can get from

anywhere to anywhere in one, anywhere from anywhere in the other one, and you can

also go between them at will, so you can get from

anywhere in this union to anywhere in the union. So not just in this context of

competing strong components but also just more generally, this is a useful fact to

know about directed graphs. On the one hand, they can have very complex structure

within a strong components. You have paths going from everywhere to everywhere else,

and it may be sort of complicated looking. But at a higher level, if you abstract out

to the level of SCCs, you are guaranteed to have this simple DAG, this simple

directed acyclic graph structure. So, to reinforce these concepts, and also segue

into thinking about Kosaraju's algorithm in particular, let me ask you a question

about how reversing arcs affects the strong components of a directed graph. So

the correct answer to this quiz is the fourth one. The strong components are

exactly the same as they were before, in fact the relation that we described is

exactly the same as it was before so therefore the equivalence classes of the

strong components is exactly the same. So if two nodes were related in

the original graph, that is a path from U to V and a path from V to U, that's still

true after you reverse all the arcs, you just use the reversal of the two paths

that you had before. Similarly if the two nodes weren't related before, for example

because you could not get from U to V, well that after you reverse everything,

then you can't get from V to U, so again you don't have this relation holding, so

the SCCs are exactly the same in the forward or the backward graph. So in

particular in Kazarogi's algorithm, the strong component structure is exactly the

same in the first pass of DFS and in the second pass of DFS. So now that we understand

how every directed graph has a meta graph with the nodes correspond to a strong

connected components, and you have an arch from one SCC to another if there's any

arch from any node in that SCC to the other SCC in the original graph, I'm in a

position to state what's the key lemma. That drives the correctness of Kosaraju's

two pass algorithm for computing the strong connected component of a directed graph.

So here's the lemma statement. It considers two strongly connecting

components that are adjacent, in the sense that there's an arc from one node in one

of them to one node in the other one. So let's say we have one SCC - 'C1', with a

node I, and another, SCC 'C2' with a node J, and that in G, in the graph, there's an

arc directly from I to J. So in this sense we say that these SCCs are adjacent, with

the second one being in some sense after the first one. Now let's suppose we've

already run the first pass of the DFS loop subroutine. And remember that works on the

reverse graph. So we've invoked it on the reverse graph. We've computed these

finishing times. As usual we'll let f(v) denote the finishing times computed in

that depth first search subroutine on the reverse graph. The lemma then asserts the

following. It says first, amongst all the nodes in 'C1' look at the one with the

largest finishing time. Similarly amongst all nodes in 'C2' look at the one with the

biggest finishing time. Amongst all of these the claim is that the biggest

finishing time will be in 'C2' not in 'C1'. So what I wanna do next is I wanna assume that

this lemma is true temporarily. I wanna explore the consequences of that

assumption and in particular what I wanna show you is that if this lemma holds, then

we can complete the proof of correctness of Kosaraju's two-pass SCC computation

algorithm. Okay, so if the lemma is true then after... I'll give you the argument

about why we're done. About why we just peel off the SCC one at

a time with the second pass of depth first search. Now of course a proof with a hole

in it, isn't a proof. So at the end of the lecture I'm gonna fill in the hole. That

is, I'm gonna supply a proof of this key lemma. But for now, as a working

hypothesis, let's assume that it's true. Let's begin with a corollary, that is a

statement which follows essentially immediately, from the statement of a lema.

So for the corollary, let's forget about just trying to find the maximum

maximum finishing time in a single SCC. Let's think about the maximum finishing

time in the entire graph. Now, why do we care about the maximum finishing time in

the entire graph? Well, notice that's exactly where the second pass of DFS

is going to begin. Right, so it processes nodes in order from largest

finishing time to smallest finishing time. So equivalently, let's think about the

node at which the second pass of depth first search is going to begin, i.e., the

node with the maximum finishing time. Where could it be? Well, the corollary is

that it has to be in what I'm gonna call a sink, a strongly connected component, that

is a strongly connected component without any outgoing arcs. So for example

let's go back to the, meta graph of SCCs for the very first directed graph we

looked at. You recall in the very first direc ted graph we looked at in when we

started talking about this algorithm there were four SCCs. So there was a 'C1', a 'C2',

a 'C3', and a 'C4'. And of course within each of these components, there could be

multiple nodes but they are all strongly connected to each other. Now, let's use

F1, F2, F3 and F4 to denote the maximum finishing time in each of these SCCs. So

we have F1, F2, F3 and F4. So, now we have four different opportunities to apply

this lemma. Right? Those four different pairs of adjacent SCCs. And so, what do

we find? We find that well, comparing F1 and F2, because C2 comes after C1,

that is there's an arc from C1 to C2, the max finishing time in C2 has

to be bigger than that in C1. That is F2 is bigger than F1. For the same

reasoning F3 has to be bigger than F1. Symmetrically we can apply the limit to

the pair C2, C4 and C3, C4 and we get that F4 has to dominate both of them. Now

notice we actually have no idea whether F2 or F3 is bigger. So that pair we can't

resolver. But we do know these relationships. Okay F1 is the smallest and

F4 is the smallest [biggest!!]. And you also notice that C4 is a sink SCC and the sink has no

outgoing arches and you think about it that's a totally general consequence of

this lema. So in a simple group of contradiction will go as follows. Consider

this SCC with the maximum F value. Suppose it was not a

sink SCC that it has an outgoing arch, follow that outgoing arch to get some

other SCC by the lema the SCC you've got into has even bigger maximum finishing

time. So that contradicts the fact that you started in the SCC with a maximum

finishing time. Okay. So just like in this cartoon, where the unique sink SCC has

to have the largest finishing time, that's totally general. As another sanity check,

we might return to the nine node graph, where we actually ran Kasaraja's

algorithm and looking at the ford version of the graph, which is the one on

the bottom, we see that the maximum finishing times in the three FCC are 4,6

and 9. And it turns out that the same as the leader nodes which is not an

accident if you think about it for a little while and again you'll observe the

maximum finishing time in this graph namely 9 is indeed in the left most SCC

which is the only SCC with no outgoing arks. Okay but it's totally general

basically you can keep following arks and you keep seeing bigger and bigger

finishing times so the biggest one of all it has to be somewhere where you get stuck

where you can't go forward but there's no outgoing arks and that's what I'm calling

a sink SCC. Okay. So assuming the lemma is true we know that the corollary

is true. Now using this corollary let's finish the proof of correctness, of

Kasaraja's algorithm, module over proof of the key lima. So I'm not going to do

this super rigorously, although everything I say is correct and can be made, made

rigorous. And if you want a more rigorous version I'll post some notes on the course

website which you can consult for more details. So what the previous corollary

accomplished, it allows us to locate, the node with maximum finishing time. We can

locate it in somewhere in some sink SCC. Let me remind you about the

discussion we had at the very beginning of talking about computing strong components.

We're tryna understand depth-first search would be a useful workhorse for finding

the strong components. And the key observation was that it depends, where you

begin that depth-first search. So for example in this, graph with four SCC's

shown in blue on the right. A really bad place to start. DFS called Depth For

Search would be somewhere in C1. Somewhere in this source SCC, so this is a bad DFS.

Why is it bad? Well remember what Depth For Search does; it finds everything

findable from its starting point. And from C1 you can get to the entire world, you

can get to all the nodes in the entire graph. So you can discover everything. And

this is totally useless because we wanted to discover much more fine-grain

structure. We wanted to discover C1, C2, C3 and C4 individually. So that would be

an disaster if we invoked depth first search somewhere from C1. Fortunately

that's not what's going to happen, right? We computed this magical ordering in the

first pass to insure that we look at the node with the maximum finishing time and,

by the corollary, the maximum finishing time is going to be somewhere in C4.

That's gonna be a good DFS, in the sense that, when we start exploring from

anywhere in C4, there's no outgoing arcs. So, of course, we're gonna find everything

in C4. Everything in C4's strongly connected to each other. But we can't get

out. We will not have the option of trespassing on other strong components,

and we're not gonna find'em. So we're only gonna find C4, nothing more. Now, here's

where I'm gonna be a little informal. Although, again, everything I'm gonna say

is gonna be correct. So what happens now, once we've discovered everything in C4?

Well, all the nodes in C4 get marked as explored, as we're doing depth first

search. And then they're basically dead to us, right? The rest of our depth first

search loop will never explore them again. They're already marked as explored. If we

ever see'em, we don't even go there. So the way to think about that is when we

proceed with the rest of our for loop in DFS loop it's as if we're starting

afresh. We're doing depth first search from scratch on a smaller graph, on the

residual graph. The graph G with this newly discovered strong component 'C<i>'</i>

deleted. So in this example on the right, all of the nodes in C4 are dead to us and

it's as if we run DFS anew, just on the graph containing the strong components C1,

C2 and C3. So in particular, where is the next indication of depth first search

going to come from? It's going to come from some sink SCC in the residual graph,

right? It's going to start at the node that remains and that has the largest

finishing time left. So there's some ambiguity in this picture. Again recall we

don't know whether F2 is bigger or F3 is bigger. It could be either one. Maybe F2

is the largest remaining finishing time in which case the next DFS indication's gonna

begin somewhere more from C2. Again, the only things outgoing from C2 are these

already explored nodes. Their effectively deleted. We're not gonna go there again.

So this is essentially a sink FCC. We discover, we newly discover the nodes in

C2 and nothing else. Those are now effectively deleted. Now, the next

indication of DFS will come from somewhere in F3, somewhere in C3. That's the only

remaining sink SCC in the residual graph. So the third call, the DFS will discover

this stuff. And now, of course, we're left only with C1. And so the final indication

of DFS will emerge from and discover the nodes in C1. And in this sense because

we've ordered the nodes by finishing times when DFS was reverse graph, that ordering

has this incredible property that when you process the nodes in the second pass we'll

just peel off the strongly connected components one at a time. If you think

about it, it's in reverse topological order with respect to the directed asypric

graph of the strongly connected components. So we've constructed a proof

of correctness of Kosaraju's, algorithm for computing strongly connected

components. But again, there's a hole in it. So we completed the argument assuming

a statement that we haven't proved. So let's fill in that last gap in the proof,

and we'll we done. So what we need to do is prove the key lemma. Let me remind you

what it says. It says if you have two adjacent SCCs, C1 and C2 and is an arc from a

node in C1, call it 'I' to a node in C2, say J. Then the max finishing time in C2 is

bigger than the max finishing time in C1. Where, as always, these finishing

times are computed in that first pass of depth-first search loop in the reversed

graph. All right, now the finishing times are computed in the reversed graph, so

let's actually reverse all the arcs and reason about what's happening there. We

still have C1. It still contains the node I. We still have C2, which still

contains the node J. But now of course the orientation of the arc has reversed. So

the arc now points from J to I. Recall we had a quiz which said, asked you to

understand the effect of reversing all arcs on the SCC's and in particular there

is no effect. So the SCC's in the reverse graph are exactly the same as in the

forward graph. So now we're going to have two cases in this proof and the cases

correspond to where we first encounter a node of C1 and union C2. Now remember,

when we do this DFS loop, this second pass, because we have this outer four loop

that iterates over all of the nodes we're guaranteed to explore every single node of

the graph at some point. So in particular we're gonna have to explore at some point

every node in C1 and C2. What I want you to do is pause the algorithm. When it

first, for the first time, explores some node that's in either C1 or C2. There's

going to be two cases, of course, because that node might be in C1, you might see

that first. Or it might be in C2, you might see something from C2 first. So our

case one is going to be when the first node that we see from either one happens

to lie in C1. And the second case is where the first node V that we see happens to

lie in C2. So clearly exactly one of these will occur. So let's think about case one.

When we see a node of C1 before we see any nodes of C2. So in this case where we

encounter a node in C1 before we encounter any node in C2, the claim is that we're

going to explore everything in C1 before we ever see anything in C2. Why is that

true? The reason is there cannot be a path that starts somewhere in C1, for example,

like the vertex V, and reaches C2. This is where we are using the fact that the

meta-graph on SCC is a cyclic. Right C1 is strong

connected, C2 is strong connected, you can get from C2 to C1 and, if you can also get

from C1 back to C2 this all collapses into a single strongly connected component. But

that would be a contraction, we're assuming C1 and C2 are distinct strongly

connected components, therefore you can't have paths in both directions. We already

have a path from right to left, via JI, so there's no path from left to right. That's

why if you originate a depth first search from somewhere inside C1 like this vertex

V, you would finish exploring all of C1 before you ever are going to see C2,

you're. Only gonna see C2 at some later point in the outer for loop. So, what's

the consequence that you completely finish with C1 before you ever see C2? Well it

means every single finishing time in C1 is going to be smaller than every single

finishing time in C2. So that's even stronger that what we're claiming, we're

just claiming that the biggest thing in C2 is bigger than the biggest of C1. But

actually finishing times in C2 totally dominate those in C1, because you finish

C1 before you ever see C2. So let's now have a look at case one actually in

action. Let's return to the nine-node graph, the one that we actually ran

Kosaraju's algorithm to completion. So if we go back to this graph which has the

three connected components, then remember that the bottom version is the forward

version, the top version is the reversed version. So if, if you think about the

middle SCC as being c1, pulling the row of c1 and the left most. Scc playing the role

of C2, then what we have exactly is case one of the key lemma. So, which was

the first of these six vertices visited during the DFS loop in the reversed graph?

Well that would just be the node with the highest name, so the node nine. So this

was the first of these six vertices that depth first search looked at in the first

pass, that lies in what we're calling C1. And indeed everything in C1 was discovered

in that pass before anything in C2 and that's why all of the finishing times in

C2, the 7,8,9 are bigger than all of the finishing times in C1 -

the 1,5, and 6. So we're good to go in case two. We've proven sorry, in case one,

we've proven the lemma. When it's the case that, amongst the vertices in C1

and C2, depth first search in the first pass sees something from C1 first. So

now, let's look at this other case, this grey case, which could also happen,

totally possible. Well, the first thing we see when depth first searching in the

first pass, is something from C2. And here now is where we truly use the fact that

we're using a depth first search rather than some other graph search algorithm

like breadth first search. There's a lot of places in this algorithm you could swap

in breadth first search but in this case two, you'll see why it's important we're

using depth first search to compute the finishing times. And what's the key point?

The key point is that, when we invoke depth first search beginning from this

node V, which is now assuming the line C2. Remember depth first search will not

complete. We won't be done with V until we've found everything there is to find

from it, right? So we recursively explore all of the outgoing arcs. They recursively

explore all of the outgoing arcs, and so on. It's only when all paths going out of

V have been totally explored and exhausted, that we finally backtrack all

the way to V, and we consider ourselves done with it. That is, depth first search.

In the reverse graph initiated at v. Won't finish until everything findable has been

completely explored. Because there's an arc from C2 to C1, obviously everything to

C2 is findable from V, that's strongly connected. We can from C2 to C1 just using

this arc from J to I. C1 being strongly connected we can then find all of that.

Maybe we can find other strongly connected components, as well, but for sure

depth-first search starting from V will find everything in C1 to C2, maybe some

other things. And we won't finish with V until we finish with everything else,

that's the depth-first search property. For that reason the finishing time of this

vertex V will be the largest of anything reachable from it. So in particular it'll

be larger than everything in C two but more to the point, it'll be larger than

everything in C1 which is what we are trying to prove. Again let's just see

this quickly in action in the nine node network on which we traced through

Kosaraju's algorithm. So to show the rule that case two is playing in this concrete

example let's think of the right most strongly connected component as being C1.

And let's think of the middle SCC as being C.2. Now

the last time. We called the middle one C1 and the leftmost one C2. Now we're calling

the rightmost one C1 and the middle one C2. So again, we have to ask the question,

you know, of the six nodes in C1 and in C2, what is the first one encountered in

the depth first search that we do in the first pass. And then again, is the node

nine? The, the node which is originally labeled not. So it's the same node that

was relevant in the previous case, but now with this relabeling of the components,

nine appears in the strongly connected component C-2, not in the one labeled C-1.

So that's the reason now we're in case two, not in case one. And what you'll see

is, what is the finishing time that this originally labeled nine node gets. It gets

the finishing time six. And you'll notice six is bigger than any of the other

finishing times of any of the other nodes in C1 or C2. All, the other five nodes

have the finishing times one through five. And that's exactly because when we ran

depth first search in the first pass, and we started it at the node originally

labeled nine, it discovered these other five nodes and finished exploring them

first before finally back tracking all the way back to nine, and deeming nine fully

explored. And it was only at that point that nine got its finishing time after

everything reachable from it had gotten there lower finishing times. So that

wraps it up. We had two cases depending on whether in these two adjacent SCC's, the

first vertex encountered was in the C1, or in C2. Either way it doesn't matter, the

largest finishing time has to be in C2. Sometimes it's bigger than everything,

sometimes it's just bigger than the biggest in C-1, but it's all the same to

us. And to re cap how the rest of the proof goes, we have a corollary based on

this lemma, which is maximum finishing time have to lie in sink SCC

And that's exactly where we want our depth first search to

initiate. If you're initiated in a strong component with no outgoing arcs, you do

DFS. The stuff you find is just the stuff and that strongly connected component. You

do not have any avenues by which to trespass on other strong components. So

you find exactly one SCC. In effect, you can peel that off and recurse on the rest

of the graph. And our slick way of implementing this recursion is to just do

the single second, DFS pass, where you just treat the nodes in decreasing order

of finishing times, that in effect, unveil, unveils all of the SCCs in

reverse topological ordering. So that's it, Kosaraju's algorithm, and the complete

proof of correctness. A blazingly fast graph primitive that, in any directed

graph, will tell you its strong components.