0:00

Having mastered computing the connecting components of an undirected graph in

Â linear time, let's now turn our attention to directed graphs,

Â which also arise in all kinds of different applications.

Â Now the good news is, is we'll be able to get just a blazingly fast primitive for

Â computing connectivity information for directed graphs.

Â The bad news, if you want to call it hat,

Â is we'll have to think a little bit harder.

Â So it won't be as obvious how to do it.

Â But by the end of this lecture will you know linear time algorithms,

Â a very good concept.

Â It's really just based on depth first search for

Â computing all of the pieces of a directed graph.

Â In fact, it's not even so obvious how to define pieces,

Â how to define connected components in a directed graph.

Â Certainly not as obvious as it was with undirected graphs.

Â So, see what I mean, let's consider the following four node directed graph.

Â 0:45

So on the one hand, this graph in some sense in one piece.

Â If this was an actual physical object, made say of a bunch of strings connected

Â to each other and we picked it up with our hands,

Â it wouldn't fall apart into two pieces, it would hang together in one piece.

Â On the other hand, when you think about moving around this network,

Â it's not connected in the sense that we might think about.

Â You cannot get from any one point a, to any other point b.

Â For example, if you started the right-most node in this graph,

Â certainly there's no directed path that will get you to the left-most node.

Â So what's typically studied and

Â most useful for directed graphs is what you call strong connectivity.

Â And a graph is strongly connected if you can get from any one point to any

Â other point and vice versa.

Â And the components then informally are the maximal portions of this graph,

Â the maximal regions, which are internally strongly connected.

Â So the maximum regions from within which you can get from any one

Â ,point a to any other point b along a directed graph.

Â For the record, let me go ahead and give you a formal definition,

Â although this intuition is perfectly valid.

Â Just regions where you can get from anywhere to anywhere else.

Â 1:55

And as in the undirected case, we're going to give a somewhat slick definition

Â rather than talking about maximal region satisfying some property.

Â We're going to talk about the equivalence classes of a particular

Â equivalence relation.

Â But really it means the same thing.

Â This is just sort of the more mathematically more mature way of writing

Â it down.

Â So the equivalence relation we're going to use, if it's on nodes of the graph, and

Â we'll say that u and node is related to a node v if you can get from u to v

Â via directed path, and also from v to u in some other directed path.

Â 2:32

So remember what it means to be an equivalence relation, that's reflexive,

Â that is everybody's related to itself.

Â But of course there is a path from every node to itself.

Â It also means that it's symmetric, so if u is related to v, then v is related to u.

Â Well again, by definition we're saying that the vertices are mutually reachable

Â from each other.

Â So that's clearly symmetric by definition.

Â And then it has to be transitive.

Â And the way you prove it's transitive is you just paste paths together, and

Â it just works the way you'd expect it to.

Â So let's illustrate this concretely with a somewhat more complicated directed graph.

Â 3:21

So what I hope is pretty clear is that each of these circled regions is indeed

Â strongly connected.

Â That is, if you start from one node in one of these circled regions,

Â you can have a directed path to any other node.

Â So that's symmetric, because on a directed cycle you can get from any

Â one starting point to anyother place.

Â And all of these have directed cycles.

Â And then there's the one strong component that just has the one node,

Â which is obviously strongly connected.

Â What I also claim is true is that all of these regions are maximal.

Â Subject to being strongly connected.

Â That's why they're the strongly connected components.

Â That's why they're the equivalence classes of this equivalence relation

Â we just defined.

Â So if you take any two pairs of nodes which lie in two different circles,

Â you either won't have a path from the first one to the second, or

Â you won't have a directed path from the second one back to the first one.

Â In fact, the structure of the strong components in this black graph

Â exactly mirrors the directed acyclic graph that we started in red.

Â So in the same way in the red four-node graph, you can't move from right to left.

Â Here in this bigger graph, you can't move from any of the circled SCCs to the right

Â to any of the circled SCCs to the left.

Â So for example, from the right-most nodes,

Â there are no directed paths to the left-most nodes.

Â 4:26

So that's a recap of the definition of the strongly connected components.

Â I've motivated in a separate video some reasons why you might care about computing

Â strongly connected components.

Â And in particular,

Â on extremely large graphs which motivates the need for blazingly fast sub routines,

Â so four free primitives that will let you compute connectivity information.

Â So you'd love to be able to just know the pieces of a directed graph.

Â Maybe you don't even know why they're going to be useful but

Â you just compute them because why not?

Â It's four free primitive.

Â So, that's what I'm going to give you on this lecture.

Â So the algorithm that we're going to present is based on depth-first search.

Â And that's going to be the main reason why it's going to be so blazingly fast,

Â is because depth-first search is blazingly fast.

Â Now, you might be wondering,

Â what on earth does graph search have to do with computing components?

Â They don't seem obviously related.

Â So let's return to the same directed graph that I shows you on the previous slide.

Â 5:17

And to see why something like depth-first search might conceivably have some use for

Â computing strong components.

Â Suppose we called depth-first search starting from this red node as

Â a starting point.

Â What would it would explore?

Â So remember what the guarantee of something like depth-first search or

Â breath first search for that matter is.

Â You find everything that findable, but naturally nothing else.

Â So what is findable from this red node?

Â Where by findable I mean,

Â you can reach it from a directed path emanating from this red node.

Â Well, there's not much you can do.

Â So from here you can explore this arc.

Â And you can explore this arc.

Â And then you can go backward.

Â And so, if you do DFS or BFS from this node,

Â you're going to find precisely the nodes in this triangle.

Â All of the other arcs involved go the wrong direction and

Â they won't be traversed by, say, a depth-first search call.

Â So, why is that interesting?

Â What's interesting is that if we invoke DFS from this red node, or

Â any of the three nodes from this triangle, then it's going to discover precisely this

Â strongly connected component, precisely the three nodes in this circled SCC.

Â So that seems really cool, seems like maybe we just do a DFS, and

Â boom we get an SCC.

Â So maybe if we can do that over and over again we'll get all the SCCs.

Â So that's a good initial intuition, but something can go wrong.

Â Suppose that instead of initiating DFS from one of these three nodes on

Â the triangle, we say, initiated from this bottom most node in green.

Â 6:44

So remember, what is the guarantee of a graph search subroutine like DFS?

Â It will find everything findable but of course, nothing more.

Â So what's findable from this green node?

Â Well, naturally everything in its own SCC, right?

Â So the four nodes here, it'll certainly find those four nodes.

Â On the other hand, if we start from this green node, since there are arcs that

Â go from this bottom-most SCC to the right-most the SCC.

Â Not only will this DFS call find the four nodes in the green node's strong

Â component, but it will also traverse these blue arcs and

Â discover the three nodes in the red triangle.

Â So, if we call DFS from this green node, we'll capture all seven of these.

Â So the point is, if we call DFS,

Â it looks like we're going to get a union of possibly multiple SCCs.

Â In fact, in the worst case,

Â if we invoke DFS from the leftmost node, what's it going to discover?

Â It's going to discover the entire graph.

Â And that didn't give us any insight into the strong component structure at all.

Â So, what's the takeaway point is, the takeaway point is if you call

Â DFS from just the right place, you'll actually uncover an SCC.

Â If you call it from the wrong place, it will give you no information at all.

Â So the magic of the algorithm that we're going to discuss next is we'll show having

Â this super slick pre-processing step which ironically is itself is called

Â a depth-first search.

Â We can in linear time compute exactly where we want to start the subsequent

Â depth-first searches from, so that each indication

Â gets us exactly one strongly connected component and nothing more.

Â 9:37

Now the naive way to implement this would be to literally construct

Â a new copy of the input graph with all the the arcs in the reverse direction, and

Â then just run depth first search on it.

Â Of course, the sophisticated, the sort of obvious optimization would be

Â to just run DFS on the original graph, but going across arcs backwards.

Â So I'll let you think through the details of how you'd do that, but that just works.

Â You run DFS, and instead of going forward along edges, you go backward along edges,

Â that simulates depth first search on the reverse graph.

Â Now I've written here DFS loop and that just means the user will check more to

Â make sure that you see all of the nodes of the graph even if it's disconnected

Â you have an outer loop where you just try each starting point separately.

Â If you haven't already seen it then you run DFS from that given node.

Â I'll be more detailed on the next slide.

Â 10:27

Now at this point you should be thinking that I'm totally crazy, right, so

Â what are we trying to do?

Â We're trying to compute these strongly connected components.

Â We're trying to actually compute real objects, these maximal regions and

Â all I'm doing is searching the graph.

Â I do it once forward.

Â I do it once backward.

Â I mean it doesn't seem like I'm computing anything.

Â So here's the catch and it's a very minor catch.

Â So we're going to get you to do a little bit of bookkeeping, it's going to be very

Â little overhead so we'll still have a blazingly fast algorithm.

Â So but with a little bit of bookkeeping, here's what's going to happen.

Â The second depth first search.

Â Which searches the graph, will in it's search process discover the components

Â one at a time in a very natural way.

Â And that will be really obvious when we do an example

Â which we'll do in just a second.

Â Now, for the second depth first search to work as magical way where it

Â just discovers the connective component one at a time, it's really important that

Â executes the depth first searches in a particular order,

Â that it goes to the nodes of the graph in a particular order.

Â And that is exactly the job of the first pass.

Â The depth first search on reverse graph is going to compute an ordering of the nodes

Â which, when the second depth first search goes through them in that order,

Â it will just discover the SCCs one at a time.

Â In linear time.

Â 12:41

So that's the algorithm at a high level.

Â It's really just two passes of DFS with some bookkeeping, but

Â this is under specify.

Â You really shouldn't understand how to implement the algorithm just yet.

Â So what do I owe you?

Â I owe you exactly what I mean by the DFS-loop, although this is seen more or

Â less in the past.

Â It's just a loop over all the vertices of the graph, and

Â if you haven't seen something yet in DFS from that starting point,

Â I need to tell you what finishing times are and how they get computed.

Â They're just going to be integers 1 to n,

Â which is basically when depth first search gets finished with one of the nodes, and

Â then I need to tell you how you compute these leaders.

Â So, let me tell you all three of those things on the next slot.

Â 14:46

In our first depth first search it's going to be labeled totally arbitrary, so

Â these are basically just the names of the node or their position in the node array,

Â whatever, you just do it in some arbitrary order.

Â Now the second time we run DFS loop, as indicated on the previous slide,

Â we're going to use the finishing times as the labeling.

Â As we'll see, the finishing times are indeed numbers between 1 in it, so

Â now what do we do is we just iterate through the nodes in decreasing order.

Â 16:42

Now once this for loop has completed,

Â once we've examined every outgoing arc from i and for each node j.

Â Either we already saw it in the past or we've recursively explored from j and

Â have returned.

Â At the point, we call ourselves done with node i,

Â there's no more outgoing arc to explore.

Â We think of it being finished, remember t is the global variable that's keeping

Â track of how many nodes we're done with, so

Â we increment t because now we're done with i.

Â And we also remember that i was the t-th vertex with which we finished.

Â 17:14

That is, we said i's finishing time to be t.

Â Because depth first search is guaranteed to visit every node exactly once,

Â and that therefore finish with every node exactly once.

Â This global counter t, well when the first node is finished it'll be value 1,

Â then when the next node gets finished I'll have value 2,

Â then it'll have value 3 and so on.

Â When the final node gets finished with it'll have value n.

Â So the finishing times of the nodes are going to be exactly the integers

Â from 1 to n.

Â Let's make this much more concrete by going through a careful example.

Â 17:45

In fact, I think it'll be better for everybody if you,

Â yourself, traced through part of this algorithm on a concrete example.

Â So let me draw a nine node graph for you.

Â So to be clear, let's assume that we've already executed step one

Â of the algorithm, and we've already reversed the graph.

Â So that is, this blue graph that I've drawn on the slide, this is the reversal.

Â We've already reversed the arcs.

Â Moreover the nodes are labeled in some arbitrary way from 1 to 9.

Â Just assume these are how they show up in the node array for example and

Â remember in the DFS loop routine you're supposed to process the nodes from

Â 18:20

top to bottom from n down to 1.

Â So my question for

Â you then is in the second step of the algorithm when we run DFS-Loop, and

Â we process the nodes from the highest name 9 in order down to the lowest name 1.

Â What are the finishing times that we're going to compute as we run DFS-Loop?

Â Now, it is true that you can get different finishing times depending on the different

Â choices that the DFS-Loop has to make about which outgoing arc to explore next.

Â But I've given you four options for what the finishing times of the nodes 1,2,3,

Â all the way up to 9, respectively, might be.

Â And only one of these four could conceivably be an output of the finishing

Â time of DFS loop on this graph, so which one is it?

Â All right so the answer is the fourth option,

Â that is the only one of these four sets of finishing times that you

Â might see being computed by DFS-Loop on this blue graph.

Â So let's go ahead and trace through DFS-Loop and

Â see how we might get this set of finishing times.

Â So remember in the main loop we start from the highest node 9 and

Â then we descend down to the lowest node 1.

Â So we start by invoking DFS from the node 9.

Â 19:30

So now from here there's only one outgoing arc, we have to go to so

Â we mark 9 as explored.

Â And then there's only one place we can go, we can go to 6.

Â So we mark 6 as explored.

Â Now there's two places we can go next,

Â we can either go to 3 or we can go to 8 and in general DFS could do either one.

Â Now to generate this fourth set of finishing times

Â I'm going to need to assume that I go to 3 first okay?

Â So again, what DFS does, what we're assuming it does, it starts at 9, and

Â it has to go to 6, it marks those as explored, then it goes to 3.

Â It does not go to 8 first it goes to 3 first.

Â Now, from 3, there's only one outgoing arc which goes to 9, but 9,

Â we've already marked as explored.

Â So it's not going to re-explore 9, it's going to skip that arc.

Â Since that's 3's only outgoing arc, then that for loop completes, and

Â then 3 is the first node to finish.

Â So when we finish with 3, we increment t, it started at 0, now it's 1, and

Â we set the finishing time of 3 to be 1.

Â Just like we said it was in the example.

Â So, now we backtrack to 6.

Â Now we have another outgoing arc from 6 to explore, so now we go to 8.

Â From 8 we have to go to 2, from 2 we have to go to 5, from 5 we have to go 8.

Â 8 we've already seen, so then we're going to be done with 5,

Â because that was its only outgoing arc.

Â So then we increment t, now it's 2, and

Â the finishing time of 5 is going to be 2 as promised.

Â 21:28

Now if we were computing those leaders all of these nodes would get the leader 9,

Â but again the leaders are only relevant for the second pass.

Â So we're just going to ignore the leaders as we're doing this tree so

Â we're just going to keep track of finishing times.

Â So now we're not done so all we did is we finished with the DFS that is invoked from

Â the node 9 and we found 6 of the nodes total in that depth first search.

Â So now we return to the outer for loop and we decrement i.

Â So it started at 9, we're done with that, now we go down to 8.

Â We say, have we already seen 8, yes 8's already explored so we skip it.

Â We go, we decrement i down to 7, we say have we already seen node 7?

Â No we have not okay?

Â 7 is not yet explored.

Â So we invoke DFS now from node sever 7 has two outgoing arcs,

Â it can either go to 4 or it can go to 9.

Â Let's say it checks the outgoing arc to 9 first.

Â Now 9 we already explored.

Â Granted, that was an earlier part of the for loop, but we remember that.

Â We're going to keep track of who got explored on previous iterations of the for

Â loop so we don't bother to re-explore 9, so we skip that.

Â So now from 7 we have to go to 4, from 4 we have to go to 1,

Â from 1 we have to go back to 7.

Â 7's already been exploratory backtrack and now we're done with 1.

Â So 1 is the next one we're completed with and

Â the finishing count of 1 is going to be 7 as promised.

Â We backtrack to 4, there's no more outgoing arcs from 4 to explore,

Â so that's going to be the eighth one to finish.

Â 22:49

As promised, and the last one to finish is poor node 7.

Â It is last.

Â So that would be an example of how the DFS-Loop subroutine computes finishing

Â times on a reversed graph.

Â So now, let's work through the second pass on the forward version of the graph using

Â the same example.

Â Now remember, the point of the first pass is to compute a magical ordering, and

Â the magical ordering is these finishing times.

Â So now we're going to throw out the original node names,

Â and we're going to replace the node names in blue by the finishing times in red.

Â We're also going to work with the original graphs,

Â which means we have to reverse the arcs back to where they were originally.

Â So those are the two changes you're going to see when I redraw this graph.

Â First of all, all the arcs were reverse orientation.

Â Second of all, all of nodes will change names from their original ones

Â to the finishing times that we just computed.

Â 23:43

So here's our new graph with the new node names and

Â all of the arcs with their orientation reversed.

Â And now we run DFS again on this graph.

Â And again we're going to process the nodes in order from the highest label 9 down to

Â the lowest label 1.

Â Moreover, we don't need to compute finishing times in the second pass,

Â we only need to do that in the first pass.

Â In the second pass we have to keep track of the leaders, and remember the leader of

Â a vertex Is the vertex from which DFS was called that first discovered that node.

Â All right, so what's going to happen?

Â Well, in the outer for loop, again, we start with i equal to nine, and

Â we invoke DFS from the node 9.

Â 24:22

So, that's going to be the current leader because that's where

Â the current DFS got initiated.

Â Now, from 9, there's only one choice.

Â We have to go to 7.

Â From 7, there's only one choice, we have to go to 8.

Â From 8, there's only one choice, we have to go back to 9.

Â And then, 9's already been seen, so we backtrack.

Â We go back to 8, we go back to 7, we go back to 9, and that's it.

Â So, when we invoke DFS for node 9,

Â the only things that we encounter are, the nodes 7, 8, and 9.

Â And these are all going to be given the leader vertex 9.

Â You will notice that this is indeed one

Â of the strongly connected components of the graph.

Â We just sort of found it with this indication of DFS from the node 9.

Â So, now we go back to the outer for loop.

Â And we say, okay, let's go to node 8, have we already seen 8?

Â Yes.

Â What about 7, have we already seen 7?

Â Yes.

Â What about 6?

Â Have we have already seen 6?

Â We have not, we have not yet discovered 6, so

Â we invoke DFS from node 6, we reset the global source vortex s to 6.

Â From 6, we can go to 9, we can go to 1.

Â So, let's say we explore 9 first.

Â Well, we already saw 9 in an earlier iteration in the for loop, so

Â we don't explore it again, so, we don't discover 9 now, so we backtrack to 6.

Â We go to 1, from 1, we have to go to 5, and 5 we have to go to 6,

Â and then we start backtracking again.

Â So, the only new nodes that we encounter when we invoke DFS

Â from the node 6 are the vertices 6, 1, and 5.

Â And all of these will have a leader vertex of

Â 6 because that's where we called DFS from when we first discovered these 3 nodes.

Â And you'll notice, this is another FCC of this directed graph.

Â So, we invoke DFS again, not from a new node, the new node 6.

Â And what it discovered,

Â the new nodes it discovered was exactly an FCC of the graph.

Â Nothing more, nothing less.

Â So, now we return to the outer for loop, we go to node 5, have we already seen 5?

Â Yes.

Â Have we already seen 4?

Â No, we haven't seen 4 yet.

Â So, now we invoke DFS from 4.

Â Again, we could try to explore 5, but we've seen that before,

Â we're not going to explore it again.

Â So from 4 then, we have to go to 2, from 2 we have to go to 3,

Â from 3 have to go back to 4, and then, after all the backtracking, we're done.

Â So, the final call to DFS will be from the node 4.

Â 26:40

And, that DFS will discover precisely, newly discover,

Â precisely the nodes 2, 3 and 4.

Â They will all have the leader vertex 4 because that was where this DFS was

Â called from.

Â It's true we'll go back to the for loop and we'll check have we seen 3 yet?

Â Yes. Have we seen 2 yet?

Â Yes.

Â Have we seen 1 yet?

Â Yes, and then the whole thing completes.

Â And, what we see is that,

Â using the finishing times computed from that first depth first search pass,

Â somehow the strongly connected components of this graph just showed up and

Â presented themselves to us and one at a time on a silver platter.

Â Every time we invoke DFS, the nodes we discovered newly were

Â precisely one of the FCCs, nothing more, nothing less.

Â And, that's really what's going on in this algorithm,

Â turns out this was true in general.

Â The first pass, DFS on the reverse graph, computes finishing times so

Â that if you then process nodes according to decreased order in finishing times.

Â In the second pass,

Â each invocation to DFS will discover one new FCC and exactly one FCC.

Â So, they'll just present themselves to you,

Â one per DFS call in that second pass' for loop.

Â 27:49

This is, of course, merely an example.

Â You should not just take a single example as proof that this algorithm always works.

Â I will give you a general argument in the next video.

Â But hopefully there is at least a plausibility arcing.

Â No longer does this three-step algorithm seem totally insane.

Â And maybe you could imagine, perhaps it works.

Â At least there's some principles going on.

Â Where you first compute the right ordering to process the nodes, and

Â then the second pass peels off FCCs one at a time like layers from an onion.

Â One thing that I hope is pretty clear is that this algorithm, correct or

Â not, is blazingly fast.

Â 28:26

Pretty much all you do is two depth per searches.

Â And, since depth per search as we've seen in the past, runs in time linear in

Â the size of the graph, so does Kosaraju's two-pass algorithm.

Â There are a couple subtleties and I encourage you to think about this and

Â you'll be forced to think about this in the program and project for week four.

Â So, for example, in the second pass,

Â how do you process the nodes in decreasing order of finishing time?

Â You don't want to sort the nodes by their finishing time,

Â because that would take n log and time.

Â So, you need to make sure that you remember in the first pass,

Â that you sort of remember the nodes in a way that you can just do

Â a linear scan through them in a second pass.

Â So, there are some details, but, if your intuition is that this is really just

Â double DFS properly implemented, that's pretty much exactly right.

Â