0:00

[MUSIC]

So this week we're having you actually think through your problem very carefully.

So not only do you have to pose these two questions that you'd like to answer and

think about what data you are going to use to answer those questions.

But we're actually having you anticipate your solution or

your designed process from the start.

So in this video, what we're going to do is think about the benefits

of analyzing your solution and analyzing your algorithm, before you even run it.

So we'll describe the benefits of doing this analysis.

And think about how we might evaluate a proposed solution before even

implementing it.

And that really helps us anticipate risks and plan out our design process and

our implementation process to really take into account those potential risks and

to plan for them.

So at this point this week, you've seen three avenues of exploration

that you might choose to go on as you're thinking about your capstone project.

We talked about three very different suites of algorithms and

different problems that you might answer with those algorithms.

So let's think about just the community building and

community detection algorithm for now.

And if we think about going ahead to implement it,

we know that we might want to apply it to that UCSD Facebook data.

And do we might ask ourselves, is that candidate solution where we're looking for

those edges with high betweenness and cutting them to look for more and

more communities.

Is that going to be practical for our data?

So looking at the text files that we have,

we see that we've got about 14,000, 15,000 vertices.

Maybe we just want to restrict to the students on campus,

that's still 13,000 of them.

Are we going to be able to run this algorithm efficiently at that scale?

And how do we even answer a question like that?

So, one way to answer that question would be to write out all of the algorithm.

Figure out how to design it,

implement everything, maybe test it on some small cases, make sure it works.

And then swap out the file, put in that big monster data file and

hit run in Eclipse.

And then wait, and then wait, and then wait some more.

And at that point, if our algorithm somehow fails to terminate, or

takes too long to answer our question, that would be really sad,

because we'll have invested all that design and implementation time.

And so it would be nice to have a way of anticipating ahead of

all of that effort whether this approach is even tractable.

And we have tools for that.

So remember in course two we talked about analyzing

algorithms using asymptotic analysis.

And so we could look at each of the steps in our high level description

of the algorithm and

say roughly how many steps will this take as the size of our graph grows.

And we notice that our algorithm really consists of these sub-algorithms,

these basic steps where we're using other algorithms that we've known about and

that we've designed and analyzed before.

So we can use our previous knowledge to say,

searches linear time in the number of vertices and edges.

So that piece of my algorithm will take linear time in the size of my graph.

And at that point, once I've built my breadth for

search, then what I'm doing in those other two steps following onto that

is really exploring that breadth for structure.

And I only need to look at each piece of that a constant number of times, so

I'm still working in linear time in the size of the graph.

Now that breadth first exploration of the graph needs to happen for

every potential source node in my graph.

And so I'm repeating this linear time behavior,

order of number of vertices many times.

And so repeating a linear time

subroutine, linear time many times.

3:49

So all in all, we get about a quadratic behavior in the beginning.

And then we have to repeat all of this

beginning piece of finding the betweenness of all edges and

picking out the maximum value of that betweenness until there are no more edges.

Because every time we cut an edge we have to recompute the betweenness from scratch.

Because the shortest paths that may have been in the graph before may no longer

be there.

Paths that were not shortest paths before may now have become shortest paths.

And so with our previous algorithm,

there was no way to reuse the calculations that we'd already done.

And so we need to start from scratch each time.

And we're repeating, perhaps until there's no more edges,

which would be order of number of edges many times.

And so, all in all, our algorithm is either quadratic or

cubic depending on how we decide when to end it.

Whether we always end after a predetermined number of iterations,

in which case it's going to be a quadratic algorithm.

Or if we wait until really there's no more edges in the graph, and

then in that situation we may be as long as a cubic algorithm.

4:55

Okay, we've done our analysis, it's either quadratic or cubic so it's polynomial.

Is that okay?

Who knows?

So the asymptotic analysis doesn't give us stopwatch times for

how long our algorithm will take.

But it does tell us a bit of feeling for

how much longer a algorithm will take as we increase our data size.

So we know that for a quadratic algorithm every time we double the size of our graph

we're going to roughly quadruple the amount of time our algorithm needs to run.

And if our algorithm was running in cubic time, then every time we double the size

of our data, our algorithm run time would perhaps increase by eight fold time.

So that's interesting, and so maybe now running our algorithm on some test cases

will give us some sense of how long our algorithm will run for

our actual data with the 14,000 or 13,000 vertices.

But something that's also really useful is, especially when we're finding our

algorithms as a result of research papers that other people have done,

is to see how big a data set did their implementations run on.

And how well calibrated would

their algorithms in their estimation be for the data that we've used.

And so, for example, for this particular algorithm, the research literature that

we've referenced before suggests that this algorithm runs pretty well when we've got

data of the size that we're working with here.

So for the UCSD data, we should be just fine.

So that analysis leads us to be cautiously optimistic.

We seem like we should go ahead and devote that time and energy to implementation.

But it's still worthwhile to anticipate where things might go awry,

where we might have some obstacles.

And so it's useful as we're implementing to think ahead to what

the roadblocks might be.

And so in each of these three situations that we have talked about so far,

these potential paths that you might take for your project.

There are some clear potential pitfalls that we have.

So for example, for the cascading simulation that Christine presented

there's a question whether a graph will really ever come to equilibrium.

If we're thinking about this phenomenon of the preferences and

the graph being propagated.

Maybe the graph structure and

the preferences are set up that will never get to equilibrium and if we try to run

our algorithm until we get to that equilibrium spot, we'll never stop.

Because maybe we switch between two modes of equilibria or

maybe some other phenomenon happens.

And so, we shouldn't make the assumption that we'll get to an equilibrium.

With Leo's example with looking for that minimum number of vertices that are able

to broadcast something to an entire Twitter network.

He already brought up the point that this problem can be thought of in graph

theoretic terms as minimum dominating set, which is an NP complete problem.

And so we have to worry about any algorithms that we'll run for that for

solving that problem.

We know that NP complete problems, as far as we know,

don't have really efficient solutions for them.

So anything that we're going to do will either run really slowly,

which maybe won't work well for our real world data.

Or will be an approximation, in which case are we okay with just an approximation?

What sort of errors are we allowed and what are we willing to put up with?

And for that community building or

community detection problem that we've been talking about.

We want to think about how our communities that we detect in the graph

really relate to ground truth in our data.

Because what we're really looking at in looking for

these communities is something graph theoretic and structural about the data.

And do we know that sociologically that will

actually translate to communities in the real world?

Maybe, maybe not.

Maybe we need to do some more analysis there.

And so these sorts of pitfalls are good to keep in mind and it's useful to anticipate

those concerns about whether the algorithm will be correct,

whether it will actually find those communities.

Or whether it will actually correspond to the formal definitions that we have.

So we can have correctness both in terms of whether our model is correct and

then also whether our algorithm meets our model.

Then we can ask questions about practicality.

Will our algorithm work with the data size that we have or

will our approximation be good enough?

And then we have questions also about termination.

Now even if we think that our algorithm is great, and even if we're aware and

cognizant of these potential risks.

And we think through them very carefully as we're analyzing.

We also have more concrete

risks that we want to think about when we're budgeting our time for a project.

Because we know that part of the project time will be spent designing the classes.

So thinking about the object-oriented design of the classes and

the inherent structure in the interfaces that you will be writing and

implementing, thinking back to Course Three in the specialization.

And even course two, the inheritance structures that we talked about, and

as you've set up, which methods and fields are in each of the objects and

the classes that you design.

And so we need to allocate some time for that design process to happen.

We also want to spend some time thinking about what people have done already.

We don't need to reinvent the wheel.

Can we use some library functions?

Can we research some documentation to see if we

can bring in functionality from the outside?

And really help our project go further, by using what other people have done.

And then we want to think about data structures, and the specific low level

algorithms that we'll be using to optimize the performance of our algorithm.

If we're looking up information, should we be using a hash set or

a hash map instead of a list?

If we're continually changing the size of our data structure,

should we be using a linked list instead of an array list?

We want to think through these potential bottlenecks and

the issues that will come up as we're implementing the process.

And at each point we want to both budget our time and

also foreshadow the work that's going to come.

So that we can drive ourselves in the correct direction.