0:01

So what's the problem?

Â Well, I did say most of the stuff about graph search really doesn't matter,

Â undirected or directed, it's pretty much cosmetic changes.

Â But the big exception is when you're computing connectivity,

Â when you're computing the pieces of the graph.

Â So right now, I'm only going to talk about undirected graphs.

Â The directed case, we can get very efficient algorithms for, but it's quite

Â a bit harder work, so that's going to be covered in detail in a separate video.

Â 0:56

But we do want to sum up more formal definitions, something which is actually

Â in math that we could say is true or false about a given graph.

Â And roughly, we define the connected components of an undirected graph

Â as the maximal regions that are connected.

Â In the sense you can get from any vertex in the region from

Â any other vertex in the region using a path.

Â So, maximal connected regions in that sense.

Â Now the slick way to do this is using an equivalence relation.

Â And I'm going to do this here in part because it's really the right way to think

Â about the directed graph case, which we'll talk about in some detail later.

Â 1:57

So I'll leave it for you to do the simple check that this squiggle is indeed

Â an equivalence relation.

Â I'm going to remind you what equivalence relations are.

Â This is something you generally learn in your first class on proofs or

Â your first class in discrete math.

Â So it's just something which may or may not be true about pairs of objects.

Â In an equivalence relation, you have to satisfy three properties.

Â So first you have to be reflexive, meaning everything has to be related to itself,

Â and indeed in a graph there is a path from any node to itself, namely the empty path.

Â Also a couple of these relations have to be symmetric, meaning that if u and

Â v are related then v and u are related.

Â Because this is an undirected graph it's clear that this is symmetric.

Â If there's a path from u to v in the graph there's also a path from v to u so

Â no problem there.

Â Finally equivalence classes have got to be transitive.

Â So that means if u and v are related and so are v and w and so are u and wo.

Â That is if u and v have a path, v and w have a path, then so

Â does u and w and you can prove transtivity just by pasting the two paths together.

Â And so the upshot is, when you want to say something like

Â the maximal subset of something where everything is the same.

Â The right way to make that mathematical is using equivalence relations.

Â So over in this blue graph we want to say one, three, five, seven, and

Â nine, which are all the same in the sense that they're mutually connected and so

Â that's exactly what this relation is making precise.

Â All five of those nodes are related to each other.

Â 2 and 4 are related to each other.

Â 6, 8, and 10, all pairs of them are related.

Â So equivalence relations always have equivalence classes,

Â the maximal mutual related stuff.

Â And in this graph context, it's exactly these connected components,

Â exactly what you want.

Â So what I want to show you is you can use wrapped in an outer of

Â the vertices to compute, to identify all of the connected components of a graph.

Â In time linear in the graph in n plus n time.

Â Now you might be wondering why do you want to do that.

Â 4:02

And that boils down to just understanding whether the graph that represents you

Â network, is a connected graph, that is if it's in one piece, or not in one piece.

Â So obviously, you can ask this same question about recreational examples.

Â So if you return to the movie graph, maybe you're wondering,

Â can you get from every single actor in the IMDB database to Kevin Bacon?

Â Or are there actors for which you cannot reach Kevin Bacon?

Â Via a sequence of edges,

Â a sequences of movies in which two actors have both played a role.

Â So that's something that boils down to a connectivity computation.

Â 4:47

So let me make sure that one probably a little less obvious application of

Â undirected connectivity is that it gives us a nice quick and dirty heuristic for

Â doing clustering if you have paralyzed information about objects.

Â Let me be a little more concrete.

Â Suppose you have a set of objects that you really care about.

Â This could be documents, maybe web pages that you crawl, something like that.

Â It could be a set of images, either your own or drawn from some data base.

Â Or it could be, for example, a set of genomes.

Â Suppose further, that you have a pairwise function, which for each pair of objects

Â tells you whether they are very much like each other or very much different.

Â So let's suppose that two objects are very similar to each other,

Â like the two web pages that are almost the same.

Â Or there are two genomes where you can get from one to the other with a small number

Â of mutations.

Â Then, they have a low score.

Â So low numbers,

Â close to zero, indicates that the objects are very similar to each other.

Â High numbers, let's say, they can go up to a thousand or

Â something, indicate that they are very different objects.

Â Two webpages that have nothing to do with each other, two genomes for

Â totally unrelated parts, or two images that seem to be of

Â completely different people or even completely different objects.

Â 5:58

Now here's a graph you can construct using these objects and

Â the similarity data that you have about them.

Â So you can have a graph where the nodes are the objects.

Â Okay, so for each image, for each document, whatever,

Â you have a single node and then for a given pair of nodes,

Â you put in an edge if and only if the two objects are very similar.

Â So for example, you could put in an edge between two objects if and

Â only if the score is at most ten.

Â So remember, the more similar two objects are, the closer there score is to zero.

Â So you're going to get an edge between very similar documents,

Â very similar genomes, very similar images.

Â Now in this graph you've constructed, you can find the connected components.

Â So each of these connected components will be a group of objects, which more or

Â less are all very similar to each other.

Â So this would be a cluster of closely related objects in your database.

Â And you can imagine a lot of reasons why, given a large set of unstructured data,

Â just a bunch of pictures, a bunch of documents or

Â whatever, you might want to find clusters of highly related objects.

Â 7:34

I'm also going to assume that the nodes have names.

Â Let's say that the names are from 1 to n.

Â So these names could just be the position in the node array that these nodes occupy.

Â So this is going to be an outer for loop,

Â which walks through the nodes in an arbitrary order, let's say from 1 to n.

Â This outer for loop is to ensure that every single node of the graph will be

Â inspected for sure at some point in the algorithm.

Â Now, again, one of maxims is that we should never do redundant work, so

Â before we start exploring from some node, we check if they've already been there.

Â 8:03

And if we haven't seen i before, then we invoke the breath-first search,

Â a term we were talking about previously in the lecture, in the graph G,

Â starting from the node I.

Â So to make sure this is clear,

Â let's just run this algorithm on this blue graph to the right.

Â So we start in the outer for loop and we said i equal to 1.

Â And we say have we explored node number 1 yet.

Â And of course not, we haven't explored anything yet.

Â So the first thing we're going to do is we're going to invoke

Â BFS on node number 1 here.

Â So now we start running the usual breadth for search subroutine starting

Â from this node one and so we explore layer one here is going to be nodes 3 and 5.

Â So we explore them in some order for

Â example maybe node number 3 is what we explore second.

Â Then node number five is what we explore third and

Â 8:51

then the second layer in this component is going to be the nodes 7 and 9.

Â So we'll explore them in some order as well and say 7 first followed by 9.

Â So after this BFS initiated from node number one completes,

Â of course it will have found everything it could possibly find,

Â namely the five nodes in the same connected component as node number 1.

Â And of course, all of the five of these nodes will be marked as explored.

Â So now we return once that exits we return to the outer for loop we increment I we go

Â to I equal 2, and we say we already explored node number 2 no we have not.

Â And so now we invoke BFS again from node number 2.

Â So that will be the sixth node we explore.

Â There's not much to do from two,

Â all we can do is go to 4, so that's the seventh node we explore.

Â That BFS terminates at finding the nodes in this connected component,

Â then we go back to the outer for loop.

Â We increment i to 3, we say we've already seen node number three.

Â Yes we have, we saw that in the first breadth first search.

Â So we certainly don't bother to BFS from node 3, then we increment item four.

Â Have we seen 4?

Â Yes we have, in the second called to BFS.

Â Have we seen node 5?

Â Yes we have, in the first call to BFS.

Â Have we seen node 6?

Â No, we have not.

Â So the final indication of breadth-first search begins from node number 6.

Â That's going to be the eighth node overall that we see.

Â And then we're going to see the notes 8 and 10 in some order, so for

Â example maybe we first explore note number 8.

Â That's one of the first layer in this component,

Â and then note number 10 is the other note of the first layer in this component.

Â So in general, what's going on, well what we know

Â will happen when we invoked that first search from a given node i.

Â We're going to discover exactly the nodes in i's connected component.

Â Right, anything where there's a path from i to that node, we'll find it.

Â That's the BFS guarantee, that's also the definition of a connected component.

Â All the other nodes which have a path to i.

Â 10:39

Another thing that I hope was clear from the example, but

Â just to reiterate it, is every search call,

Â when you explore a node, you remember that through the entire for loop.

Â So when we invoke breadth-first search from node number 1, we explore nodes 1, 3,

Â 5, 7 and 9, and we keep those marked as explored for the rest of this algorithm.

Â And that's so

Â we don't do redundant work when we get to later stages of the for loop.

Â 11:04

So as far as what does this algorithm accomplish, well,

Â it certainly finds every connected component.

Â There is absolutely no way it can miss a node because this for

Â loop literally walks through the nodes, all of them, one at a time.

Â So you're not going to miss a node.

Â Moreover, we know that as soon as you hit a connected component for

Â the first time, and you do breadth-first search from that node,

Â you're going to find the whole thing.

Â That the breadth-first a search guarantee.

Â As far as what's the running time, well it's going to be exactly what we want.

Â It's going to be linear time, which again means proportional to the number of edges

Â plus the number of vertices.

Â And again depending on the graph, one of these might be bigger that the other.

Â So why is it O of m plus n?

Â Well as far as the nodes, we have to do this initialization there where we mark

Â them all as unexplored, so that takes constant time per node.

Â We have just the basic overhead of a for loop, so that's constant time per node.

Â And then we have this check, constant time per node, so that's O of n.

Â And then recall we proved that within breath for research,

Â you do a amount of work proportional.

Â You do constant time for each node in that connected component.

Â Now, each of the nodes of the graph is in exactly one of the connected components.

Â So you'll do constant time for

Â each node in the BFS in which you discover that node.

Â So that's again, o of n over all of the connected components.

Â And as far as the edges, note we don't even bother to look at edges until

Â we're inside one of these breadth first search calls.

Â They played no role in the outer for loop or in the pre-processing.

Â And remember what we proved about an indication of breadth first search.

Â The running time, you only do constant amount of work

Â per edge in the connected component that you're exploring.

Â In the worst case, you look at an edge once from either endpoint and

Â each of that triggers a constant amount of work.

Â So when you discover a given connected component,

Â the edge work is proportional to the number of edges in that kind of component.

Â Each edge of the graph is only in exactly one of the connect components, so

Â over this entire for loop, over all of these BFS calls.

Â For each edge of the graph, you'll only be responsible for

Â a constant amount of work of the algorithm.

Â So summarizing because breadth-first search from a given starting node.

Â That is, works in time proportional to the size of that component, piggybacking on

Â that sub routine and looping over all of the nodes of the graph.

Â We find all of the connecting components in time proportional to the amount of

Â edges and nodes in the entire graph.

Â