0:00

So the goal of this video is to prove the correctness of Kasaraju two-pass,

Â depth-first-search based, linear time algorithm that computes the strongly

Â connected components of a directed graph. So I've given you the full specification

Â of the algorithm. I've also given you a plausibility argument of why it might

Â work, in that at least it does something sensible on an example. Namely, it first

Â does a pass of depth first search on the reverse graph. It computes this magical

Â ordering. And what's so special about this ordering is then when we do a depth first

Â search using this ordering on the forward graph; it seems to do exactly what we

Â want. Every indication of depth first search to some new node discovers exactly

Â the nodes of the strong component and no extra stuff. Remember that was our first

Â observation, but that was unclear whether depth for search would be useful or not

Â for computing strong components. If you call depth first search from just the

Â right place, you're gonna get exactly the nodes of an SCC and nothing more. If you

Â call it from the wrong place, you might get all of the nodes of the graph, and get

Â no information at all about the structure of the strong components and at least in

Â this example this first pass with the finishing time seems to be accomplishing

Â seems to be leading us to invoking DFS from exactly the right

Â places. So remember how this worked in the example so in the top graph, I have

Â shown you the graph with the arch reversed. This is where we first invoked

Â DFS loop with the loop over the nodes going from the highest node name nine all

Â the way down to the node name one. And here we compute finishing time that's the

Â bookkeeping that we do in the first pass so we just keep the running count of how

Â many nodes we've finished processing. That is how many we've both explored that node

Â as well as explore all of the outgoing arches and so that gives us these numbers

Â in the red, these finishing times between one and nine for the various

Â nodes. Those became the new node names in the second graph and then we reverse the

Â arches again and get the original graphs back and then we saw that every time we

Â invoked DFS in our second pass we uncovered exactly the nodes of an SCC. So

Â when we invoked it from the node 9 we discovered that 9, 8 and 7 those

Â have a leader vortex 9. Then when we next invoked DFS from 6, we discovered

Â 6, 5, and 1, and nothing else. And then finally we invoked it from 4, and

Â we discovered 2, 3, and 4, and nothing else. And those are exactly the

Â three, SCCs of this graph. So let's now understand why this works in any directed

Â graph, not just this, in this one example. So let's begin with a simple observation

Â about directed graphs, which is actually interesting in its own right. The claim is

Â that every directed graph has two levels of granularity. If you squint, if you sort

Â of zoom out, then what you see is a directed acyclic graph, of course

Â comprising its strongly connective components. And if you want, you can zoom

Â in and focus on the fine grain structure with one SCC. A little bit more precisely.

Â The claim is of the strongly connected components of a directed graph induce in

Â a natural way an acyclic metagraph. So what is this metagraph? What are the nodes

Â and what are the arcs, what are the edges? Well, the metanodes are just the SCCs, so

Â we think of every strong connected component as being a single node in this

Â metagraph. So call them say 'C1' up to 'Ck'. So what are the arcs in this metagraph?

Â Well, they're basically just the ones corresponding to the arcs between SCCs in

Â the original graph. That is, we include in the meta graph an arc from the strong

Â component 'C' to 'C-hat' if and only if there's an arc from a node

Â in 'C' to a node in 'C-hat' in the original graph 'G'. So for example if this is your 'C'

Â and so the triangle is your 'C-hat' and you have one or maybe multiple edges going

Â from 'C' to 'C-hat', then in the corresponding metagraph your just gonna have a node for

Â 'C', a node for 'C-hat' and the directed arch from 'C' to 'C-hat'. So if we go back to

Â some of the directed graphs that we've used as running examples so we go back to

Â the beginning of the previous video and it's look maybe something

Â like this the corresponding directed acyclic graph has four nodes and four

Â arches. And for the running example we used to illustrate Kosaraju's algorithm with

Â the three triangles, the corresponding metagraph would just be a path with three

Â nodes. So why is this meta-graph guaranteed to be acyclic? Well, remember

Â metanodes correspond to strong components, and in a strong component you

Â can get from anywhere to anywhere else. So, if you had a cycle that involved two

Â different metanodes, that is two different strong connected components,

Â remember on a directed cycle you can also get from anywhere to anywhere else. So if

Â you had two supposedly distinct SCCs, that you could get from the one to the

Â other and vice versa, they would collapse into a single SCC. You can get from

Â anywhere to anywhere in one, anywhere from anywhere in the other one, and you can

Â also go between them at will, so you can get from

Â anywhere in this union to anywhere in the union. So not just in this context of

Â competing strong components but also just more generally, this is a useful fact to

Â know about directed graphs. On the one hand, they can have very complex structure

Â within a strong components. You have paths going from everywhere to everywhere else,

Â and it may be sort of complicated looking. But at a higher level, if you abstract out

Â to the level of SCCs, you are guaranteed to have this simple DAG, this simple

Â directed acyclic graph structure. So, to reinforce these concepts, and also segue

Â into thinking about Kosaraju's algorithm in particular, let me ask you a question

Â about how reversing arcs affects the strong components of a directed graph. So

Â the correct answer to this quiz is the fourth one. The strong components are

Â exactly the same as they were before, in fact the relation that we described is

Â exactly the same as it was before so therefore the equivalence classes of the

Â strong components is exactly the same. So if two nodes were related in

Â the original graph, that is a path from U to V and a path from V to U, that's still

Â true after you reverse all the arcs, you just use the reversal of the two paths

Â that you had before. Similarly if the two nodes weren't related before, for example

Â because you could not get from U to V, well that after you reverse everything,

Â then you can't get from V to U, so again you don't have this relation holding, so

Â the SCCs are exactly the same in the forward or the backward graph. So in

Â particular in Kazarogi's algorithm, the strong component structure is exactly the

Â same in the first pass of DFS and in the second pass of DFS. So now that we understand

Â how every directed graph has a meta graph with the nodes correspond to a strong

Â connected components, and you have an arch from one SCC to another if there's any

Â arch from any node in that SCC to the other SCC in the original graph, I'm in a

Â position to state what's the key lemma. That drives the correctness of Kosaraju's

Â two pass algorithm for computing the strong connected component of a directed graph.

Â So here's the lemma statement. It considers two strongly connecting

Â components that are adjacent, in the sense that there's an arc from one node in one

Â of them to one node in the other one. So let's say we have one SCC - 'C1', with a

Â node I, and another, SCC 'C2' with a node J, and that in G, in the graph, there's an

Â arc directly from I to J. So in this sense we say that these SCCs are adjacent, with

Â the second one being in some sense after the first one. Now let's suppose we've

Â already run the first pass of the DFS loop subroutine. And remember that works on the

Â reverse graph. So we've invoked it on the reverse graph. We've computed these

Â finishing times. As usual we'll let f(v) denote the finishing times computed in

Â that depth first search subroutine on the reverse graph. The lemma then asserts the

Â following. It says first, amongst all the nodes in 'C1' look at the one with the

Â largest finishing time. Similarly amongst all nodes in 'C2' look at the one with the

Â biggest finishing time. Amongst all of these the claim is that the biggest

Â finishing time will be in 'C2' not in 'C1'. So what I wanna do next is I wanna assume that

Â this lemma is true temporarily. I wanna explore the consequences of that

Â assumption and in particular what I wanna show you is that if this lemma holds, then

Â we can complete the proof of correctness of Kosaraju's two-pass SCC computation

Â algorithm. Okay, so if the lemma is true then after... I'll give you the argument

Â about why we're done. About why we just peel off the SCC one at

Â a time with the second pass of depth first search. Now of course a proof with a hole

Â in it, isn't a proof. So at the end of the lecture I'm gonna fill in the hole. That

Â is, I'm gonna supply a proof of this key lemma. But for now, as a working

Â hypothesis, let's assume that it's true. Let's begin with a corollary, that is a

Â statement which follows essentially immediately, from the statement of a lema.

Â So for the corollary, let's forget about just trying to find the maximum

Â maximum finishing time in a single SCC. Let's think about the maximum finishing

Â time in the entire graph. Now, why do we care about the maximum finishing time in

Â the entire graph? Well, notice that's exactly where the second pass of DFS

Â is going to begin. Right, so it processes nodes in order from largest

Â finishing time to smallest finishing time. So equivalently, let's think about the

Â node at which the second pass of depth first search is going to begin, i.e., the

Â node with the maximum finishing time. Where could it be? Well, the corollary is

Â that it has to be in what I'm gonna call a sink, a strongly connected component, that

Â is a strongly connected component without any outgoing arcs. So for example

Â let's go back to the, meta graph of SCCs for the very first directed graph we

Â looked at. You recall in the very first direc ted graph we looked at in when we

Â started talking about this algorithm there were four SCCs. So there was a 'C1', a 'C2',

Â a 'C3', and a 'C4'. And of course within each of these components, there could be

Â multiple nodes but they are all strongly connected to each other. Now, let's use

Â F1, F2, F3 and F4 to denote the maximum finishing time in each of these SCCs. So

Â we have F1, F2, F3 and F4. So, now we have four different opportunities to apply

Â this lemma. Right? Those four different pairs of adjacent SCCs. And so, what do

Â we find? We find that well, comparing F1 and F2, because C2 comes after C1,

Â that is there's an arc from C1 to C2, the max finishing time in C2 has

Â to be bigger than that in C1. That is F2 is bigger than F1. For the same

Â reasoning F3 has to be bigger than F1. Symmetrically we can apply the limit to

Â the pair C2, C4 and C3, C4 and we get that F4 has to dominate both of them. Now

Â notice we actually have no idea whether F2 or F3 is bigger. So that pair we can't

Â resolver. But we do know these relationships. Okay F1 is the smallest and

Â F4 is the smallest [biggest!!]. And you also notice that C4 is a sink SCC and the sink has no

Â outgoing arches and you think about it that's a totally general consequence of

Â this lema. So in a simple group of contradiction will go as follows. Consider

Â this SCC with the maximum F value. Suppose it was not a

Â sink SCC that it has an outgoing arch, follow that outgoing arch to get some

Â other SCC by the lema the SCC you've got into has even bigger maximum finishing

Â time. So that contradicts the fact that you started in the SCC with a maximum

Â finishing time. Okay. So just like in this cartoon, where the unique sink SCC has

Â to have the largest finishing time, that's totally general. As another sanity check,

Â we might return to the nine node graph, where we actually ran Kasaraja's

Â algorithm and looking at the ford version of the graph, which is the one on

Â the bottom, we see that the maximum finishing times in the three FCC are 4,6

Â and 9. And it turns out that the same as the leader nodes which is not an

Â accident if you think about it for a little while and again you'll observe the

Â maximum finishing time in this graph namely 9 is indeed in the left most SCC

Â which is the only SCC with no outgoing arks. Okay but it's totally general

Â basically you can keep following arks and you keep seeing bigger and bigger

Â finishing times so the biggest one of all it has to be somewhere where you get stuck

Â where you can't go forward but there's no outgoing arks and that's what I'm calling

Â a sink SCC. Okay. So assuming the lemma is true we know that the corollary

Â is true. Now using this corollary let's finish the proof of correctness, of

Â Kasaraja's algorithm, module over proof of the key lima. So I'm not going to do

Â this super rigorously, although everything I say is correct and can be made, made

Â rigorous. And if you want a more rigorous version I'll post some notes on the course

Â website which you can consult for more details. So what the previous corollary

Â accomplished, it allows us to locate, the node with maximum finishing time. We can

Â locate it in somewhere in some sink SCC. Let me remind you about the

Â discussion we had at the very beginning of talking about computing strong components.

Â We're tryna understand depth-first search would be a useful workhorse for finding

Â the strong components. And the key observation was that it depends, where you

Â begin that depth-first search. So for example in this, graph with four SCC's

Â shown in blue on the right. A really bad place to start. DFS called Depth For

Â Search would be somewhere in C1. Somewhere in this source SCC, so this is a bad DFS.

Â Why is it bad? Well remember what Depth For Search does; it finds everything

Â findable from its starting point. And from C1 you can get to the entire world, you

Â can get to all the nodes in the entire graph. So you can discover everything. And

Â this is totally useless because we wanted to discover much more fine-grain

Â structure. We wanted to discover C1, C2, C3 and C4 individually. So that would be

Â an disaster if we invoked depth first search somewhere from C1. Fortunately

Â that's not what's going to happen, right? We computed this magical ordering in the

Â first pass to insure that we look at the node with the maximum finishing time and,

Â by the corollary, the maximum finishing time is going to be somewhere in C4.

Â That's gonna be a good DFS, in the sense that, when we start exploring from

Â anywhere in C4, there's no outgoing arcs. So, of course, we're gonna find everything

Â in C4. Everything in C4's strongly connected to each other. But we can't get

Â out. We will not have the option of trespassing on other strong components,

Â and we're not gonna find'em. So we're only gonna find C4, nothing more. Now, here's

Â where I'm gonna be a little informal. Although, again, everything I'm gonna say

Â is gonna be correct. So what happens now, once we've discovered everything in C4?

Â Well, all the nodes in C4 get marked as explored, as we're doing depth first

Â search. And then they're basically dead to us, right? The rest of our depth first

Â search loop will never explore them again. They're already marked as explored. If we

Â ever see'em, we don't even go there. So the way to think about that is when we

Â proceed with the rest of our for loop in DFS loop it's as if we're starting

Â afresh. We're doing depth first search from scratch on a smaller graph, on the

Â residual graph. The graph G with this newly discovered strong component 'C<i>'</i>

Â deleted. So in this example on the right, all of the nodes in C4 are dead to us and

Â it's as if we run DFS anew, just on the graph containing the strong components C1,

Â C2 and C3. So in particular, where is the next indication of depth first search

Â going to come from? It's going to come from some sink SCC in the residual graph,

Â right? It's going to start at the node that remains and that has the largest

Â finishing time left. So there's some ambiguity in this picture. Again recall we

Â don't know whether F2 is bigger or F3 is bigger. It could be either one. Maybe F2

Â is the largest remaining finishing time in which case the next DFS indication's gonna

Â begin somewhere more from C2. Again, the only things outgoing from C2 are these

Â already explored nodes. Their effectively deleted. We're not gonna go there again.

Â So this is essentially a sink FCC. We discover, we newly discover the nodes in

Â C2 and nothing else. Those are now effectively deleted. Now, the next

Â indication of DFS will come from somewhere in F3, somewhere in C3. That's the only

Â remaining sink SCC in the residual graph. So the third call, the DFS will discover

Â this stuff. And now, of course, we're left only with C1. And so the final indication

Â of DFS will emerge from and discover the nodes in C1. And in this sense because

Â we've ordered the nodes by finishing times when DFS was reverse graph, that ordering

Â has this incredible property that when you process the nodes in the second pass we'll

Â just peel off the strongly connected components one at a time. If you think

Â about it, it's in reverse topological order with respect to the directed asypric

Â graph of the strongly connected components. So we've constructed a proof

Â of correctness of Kosaraju's, algorithm for computing strongly connected

Â components. But again, there's a hole in it. So we completed the argument assuming

Â a statement that we haven't proved. So let's fill in that last gap in the proof,

Â and we'll we done. So what we need to do is prove the key lemma. Let me remind you

Â what it says. It says if you have two adjacent SCCs, C1 and C2 and is an arc from a

Â node in C1, call it 'I' to a node in C2, say J. Then the max finishing time in C2 is

Â bigger than the max finishing time in C1. Where, as always, these finishing

Â times are computed in that first pass of depth-first search loop in the reversed

Â graph. All right, now the finishing times are computed in the reversed graph, so

Â let's actually reverse all the arcs and reason about what's happening there. We

Â still have C1. It still contains the node I. We still have C2, which still

Â contains the node J. But now of course the orientation of the arc has reversed. So

Â the arc now points from J to I. Recall we had a quiz which said, asked you to

Â understand the effect of reversing all arcs on the SCC's and in particular there

Â is no effect. So the SCC's in the reverse graph are exactly the same as in the

Â forward graph. So now we're going to have two cases in this proof and the cases

Â correspond to where we first encounter a node of C1 and union C2. Now remember,

Â when we do this DFS loop, this second pass, because we have this outer four loop

Â that iterates over all of the nodes we're guaranteed to explore every single node of

Â the graph at some point. So in particular we're gonna have to explore at some point

Â every node in C1 and C2. What I want you to do is pause the algorithm. When it

Â first, for the first time, explores some node that's in either C1 or C2. There's

Â going to be two cases, of course, because that node might be in C1, you might see

Â that first. Or it might be in C2, you might see something from C2 first. So our

Â case one is going to be when the first node that we see from either one happens

Â to lie in C1. And the second case is where the first node V that we see happens to

Â lie in C2. So clearly exactly one of these will occur. So let's think about case one.

Â When we see a node of C1 before we see any nodes of C2. So in this case where we

Â encounter a node in C1 before we encounter any node in C2, the claim is that we're

Â going to explore everything in C1 before we ever see anything in C2. Why is that

Â true? The reason is there cannot be a path that starts somewhere in C1, for example,

Â like the vertex V, and reaches C2. This is where we are using the fact that the

Â meta-graph on SCC is a cyclic. Right C1 is strong

Â connected, C2 is strong connected, you can get from C2 to C1 and, if you can also get

Â from C1 back to C2 this all collapses into a single strongly connected component. But

Â that would be a contraction, we're assuming C1 and C2 are distinct strongly

Â connected components, therefore you can't have paths in both directions. We already

Â have a path from right to left, via JI, so there's no path from left to right. That's

Â why if you originate a depth first search from somewhere inside C1 like this vertex

Â V, you would finish exploring all of C1 before you ever are going to see C2,

Â you're. Only gonna see C2 at some later point in the outer for loop. So, what's

Â the consequence that you completely finish with C1 before you ever see C2? Well it

Â means every single finishing time in C1 is going to be smaller than every single

Â finishing time in C2. So that's even stronger that what we're claiming, we're

Â just claiming that the biggest thing in C2 is bigger than the biggest of C1. But

Â actually finishing times in C2 totally dominate those in C1, because you finish

Â C1 before you ever see C2. So let's now have a look at case one actually in

Â action. Let's return to the nine-node graph, the one that we actually ran

Â Kosaraju's algorithm to completion. So if we go back to this graph which has the

Â three connected components, then remember that the bottom version is the forward

Â version, the top version is the reversed version. So if, if you think about the

Â middle SCC as being c1, pulling the row of c1 and the left most. Scc playing the role

Â of C2, then what we have exactly is case one of the key lemma. So, which was

Â the first of these six vertices visited during the DFS loop in the reversed graph?

Â Well that would just be the node with the highest name, so the node nine. So this

Â was the first of these six vertices that depth first search looked at in the first

Â pass, that lies in what we're calling C1. And indeed everything in C1 was discovered

Â in that pass before anything in C2 and that's why all of the finishing times in

Â C2, the 7,8,9 are bigger than all of the finishing times in C1 -

Â the 1,5, and 6. So we're good to go in case two. We've proven sorry, in case one,

Â we've proven the lemma. When it's the case that, amongst the vertices in C1

Â and C2, depth first search in the first pass sees something from C1 first. So

Â now, let's look at this other case, this grey case, which could also happen,

Â totally possible. Well, the first thing we see when depth first searching in the

Â first pass, is something from C2. And here now is where we truly use the fact that

Â we're using a depth first search rather than some other graph search algorithm

Â like breadth first search. There's a lot of places in this algorithm you could swap

Â in breadth first search but in this case two, you'll see why it's important we're

Â using depth first search to compute the finishing times. And what's the key point?

Â The key point is that, when we invoke depth first search beginning from this

Â node V, which is now assuming the line C2. Remember depth first search will not

Â complete. We won't be done with V until we've found everything there is to find

Â from it, right? So we recursively explore all of the outgoing arcs. They recursively

Â explore all of the outgoing arcs, and so on. It's only when all paths going out of

Â V have been totally explored and exhausted, that we finally backtrack all

Â the way to V, and we consider ourselves done with it. That is, depth first search.

Â In the reverse graph initiated at v. Won't finish until everything findable has been

Â completely explored. Because there's an arc from C2 to C1, obviously everything to

Â C2 is findable from V, that's strongly connected. We can from C2 to C1 just using

Â this arc from J to I. C1 being strongly connected we can then find all of that.

Â Maybe we can find other strongly connected components, as well, but for sure

Â depth-first search starting from V will find everything in C1 to C2, maybe some

Â other things. And we won't finish with V until we finish with everything else,

Â that's the depth-first search property. For that reason the finishing time of this

Â vertex V will be the largest of anything reachable from it. So in particular it'll

Â be larger than everything in C two but more to the point, it'll be larger than

Â everything in C1 which is what we are trying to prove. Again let's just see

Â this quickly in action in the nine node network on which we traced through

Â Kosaraju's algorithm. So to show the rule that case two is playing in this concrete

Â example let's think of the right most strongly connected component as being C1.

Â And let's think of the middle SCC as being C.2. Now

Â the last time. We called the middle one C1 and the leftmost one C2. Now we're calling

Â the rightmost one C1 and the middle one C2. So again, we have to ask the question,

Â you know, of the six nodes in C1 and in C2, what is the first one encountered in

Â the depth first search that we do in the first pass. And then again, is the node

Â nine? The, the node which is originally labeled not. So it's the same node that

Â was relevant in the previous case, but now with this relabeling of the components,

Â nine appears in the strongly connected component C-2, not in the one labeled C-1.

Â So that's the reason now we're in case two, not in case one. And what you'll see

Â is, what is the finishing time that this originally labeled nine node gets. It gets

Â the finishing time six. And you'll notice six is bigger than any of the other

Â finishing times of any of the other nodes in C1 or C2. All, the other five nodes

Â have the finishing times one through five. And that's exactly because when we ran

Â depth first search in the first pass, and we started it at the node originally

Â labeled nine, it discovered these other five nodes and finished exploring them

Â first before finally back tracking all the way back to nine, and deeming nine fully

Â explored. And it was only at that point that nine got its finishing time after

Â everything reachable from it had gotten there lower finishing times. So that

Â wraps it up. We had two cases depending on whether in these two adjacent SCC's, the

Â first vertex encountered was in the C1, or in C2. Either way it doesn't matter, the

Â largest finishing time has to be in C2. Sometimes it's bigger than everything,

Â sometimes it's just bigger than the biggest in C-1, but it's all the same to

Â us. And to re cap how the rest of the proof goes, we have a corollary based on

Â this lemma, which is maximum finishing time have to lie in sink SCC

Â And that's exactly where we want our depth first search to

Â initiate. If you're initiated in a strong component with no outgoing arcs, you do

Â DFS. The stuff you find is just the stuff and that strongly connected component. You

Â do not have any avenues by which to trespass on other strong components. So

Â you find exactly one SCC. In effect, you can peel that off and recurse on the rest

Â of the graph. And our slick way of implementing this recursion is to just do

Â the single second, DFS pass, where you just treat the nodes in decreasing order

Â of finishing times, that in effect, unveil, unveils all of the SCCs in

Â reverse topological ordering. So that's it, Kosaraju's algorithm, and the complete

Â proof of correctness. A blazingly fast graph primitive that, in any directed

Â graph, will tell you its strong components.

Â