The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

Loading...

来自 Stanford University 的课程

Greedy Algorithms, Minimum Spanning Trees, and Dynamic Programming

665 个评分

The primary topics in this part of the specialization are: greedy algorithms (scheduling, minimum spanning trees, clustering, Huffman codes) and dynamic programming (knapsack, sequence alignment, optimal search trees).

从本节课中

Week 2

Kruskal's MST algorithm and applications to clustering; advanced union-find (optional).

- Tim RoughgardenProfessor

Computer Science

So, by now we've spent quite a bit of time talking about the minimum cost

spanning tree problem. There's a number of motivations why we do

that. the first, it's just a uniquely great

problem for the study of greedy algorithms.

You can propose a bunch of different greedy algorithms, and quite unusually,

it's all of them seem to work. So you get correct greedy algorithms, but

it's quite subtle what's driving their correctness.

You also get a lot of practice, arguing about graphs, and arguing about exchange

arguments, and so on. The second reason, it's been worthwhile

spending time on these algorithms. It has given us some more practice with data

structures and how to use them to speed up algorithms,

namely heaps, to speed up Prim's algorithm,

to union-find data structure to speed up Kruskal's algorithm.

The third reason that we're talking about them is they have applications in their

own rights. That's the subject of this video and the

next and we're going to focus on applications to clustering problems.

So let me begin by just talking about the goal of clustering informally.

And then, I'll let you pin me down to a precise objective function on the next

slide. So in a clustering problem, the input is

n points that we think of as being embedded in space.

And it's actually quite rare that in the underlying problem that we care about is

it actually intrinsically geometric? Is it actually intrinsically points in

space? Usually, we're representing something we

care about. Maybe it's web pages.

Maybe it's images. Maybe it's a database as points in space.

And given a bunch of objects, we want to cluster them into, in some sense coherent

groups. For those of you coming from a machine

learning background, you'll often hear this problem referred to as unsupervised

learning, meaning the data is unlabeled. We're looking for just patterns and data

where the data is not annotated. This obviously is a fairly wishy-washy

description of a problem, so let's be a little bit more precise.

We're going to assume that part of the input is what we're going to call a

similarity measure. Meaning for any two objects, we're

going to have a function giving us a number indicating how similar or really

rather how dissimilar they are to each other.

In keeping with the geometric metaphor, we're going to refer to this function as

a distance function. One thing that's cool is we don't need to

impose many assumptions on this distance function.

The one thing we're going to assume is that it's symmetric,

meaning the distance from p to q is the same as the distance from q to p.

So what are some examples? Well, if you want to really go with the

geometric metaphor, if you're representing these points as in a space

RM for some dimension M, you can just use the Euclidean distance or if you prefer

some other norm, like say, l1 or l-infinity.

In many application domains, there are widely accepted similarity or distance

measures. one example would be for sequences, as we

discussed in introductory lecture the penalty of the best alignment between two

genome fragments. So now that we have this distance

function, what would it mean to have coherent groups?

Well, things which have small distance from each other which are similar, they

should generally be in the same group, and things that are very dissimilar that

have large distance between them, you would expect to mostly be in different

groups. So how can we evaluate how good a

clustering is, how well it's doing with the job of putting nearby points together

and dissimilar points in different groups?

Well, to be honest there's many ways of approaching and formalizing that problem.

The approach we're going to take is an optimization-based approach.

We're going to positive and objective function on clusterings and then seek out

the clustering that optimizes that objective function.

I want to warn you that's not the only way to approach the problem.

There are other interesting approaches, but optimization is a natural one.

Furthermore, just like in our scheduling application, there's more than one

objective function that people study and that is well-motivated.

So one very popular objective function would be, say the k-means objective,

which I encourage you to look up and read more about.

For this lecture, I'm just going to adopt one specific objective function.

It's natural enough, but it's by no means the only one or even the primary one.

But it'll serve the purpose of studying a natural greedy algorithm related to

minimum spanning tree algorithms which is optimal in a precise sense.

So let me develop the terminology needed to state the objective function and the

optimization problem precisely. One issue that always comes up in

clustering problems is how many clusters are you going to use?

So to keep things simple in this video, we're going to assume that part of the

input k indicates how many clusters you're supposed to use.

So we're going to assume you know a priori how many clusters you want.

So in some application domains, this is a totally reasonably assumption.

You know, you might know for example that you want exactly two clusters, the k = 2.

in some domains you may have you know, good domain knowledge from past

experience that you know how many clusters you expect to need that's, all

fine and good. Also you know, in practice, if you don't

really know what k is supposed to be, you can go ahead and run the algorithm we're

going to talk about for a bunch of different values of k and then you

symmetric or just eyeball it to figure out which is the best solution.

So the objective function we're going to look at is defined in terms of separated

pairs of points. That is points that are assigned to

distinct clusters. Now you know, if you have more than one

cluster, inevitably there's going to be some pairs of points.

Some points are going to be one groups, other points are going to be in the

different group. So separated points are inevitable and

the most alarming separated points are the ones that are the most similar, the

ones that have the smallest distance. If points are separated, we want them to

be for apart, so we're particularly concerned with

nearby points that are separated. So that's going to be our objective

function value, called the spacing of a clustering.

It's the distance between the closest together pair of separated points.

Now what do we want from the spacing of a clustering?

Well, we want all of the separated points to be as far apart as possible.

That is, we want the spacing to be big. The bigger the better.

So that naturally motivates the formal problem statement.

You're given as inputs, the, distance measure.

You're told the distance between each pair of points.

You're also told the desired number of clusters.

Amongst all ways of clustering the points into k clusters, find the clustering

which makes the spacing as big as possible.

So let's develop a greedy algorithm that seeks to make the spacing as big as

possible. And to facilitate the discussion, I'll

use an example point set with just six black points up here in the upper right

part of the slide. Now, the good idea behind this greedy

algorithm is to not worry about the constraints that, at the end of the day,

we can only output k different clusters. We're actually going to be infeasible,

we'll have too many clusters throughout the algorithm.

Only at the conclusion of the algorithm will we be down to k clusters, which will

be our final infeasible solution. So that frees us up to initialize the

procedure where the degenerate solution, where each point is in its own cluster.

So in our example point set, we have these six pink isolated clusters.

In general, you're going to have n clusters and we've got to get down to k.

Now, let's remember what the spacing objective is.

In the spacing objective, you go over all of the separated pairs of points.

So for this degenerate solution, it's all pairs of points.

And you look at the most alarming separated pair, that is those, that are

the, the closest to each other. So the spacing is the distance between

the closest pair of separated points. Now, in a greedy algorithm, you want to

increase your objective function as much as possible.

But actually, things are pretty cut and dried in this scenario.

Suppose I give you a clustering and you want to make the spacing bigger,

the only way you can do that is by taking the currently closest pair of separated

points and making them not separated any more.

That is putting them in the same cluster. So, it's in some sense obvious what you

want to do to make the objective function go up at all.

You gotta look at the pair of points that is defining the objective,

the closest pair of separated points, and you have to fuse them.

You have to fuse their clusters so that they're no longer separated.

In this example, it looks to me like the closest pair of points, which of course

are separated, is this pair in the upper right.

So, if we want to make the spacing bigger, then we fuse them into a common

cluster. [SOUND] So we started with six clusters.

Now, we're down to five. So now we reevaluate the spacing again of

this new clustering. We ask, what is the closest pair of

separated points? So that would seem to me to be this pair

in the bottom right. [SOUND] And again, the only way we can

increase the spacing by merging clustering is to merge these two isolated

clusters into one. Now, we do it again. We say, which pair

of points determines the current spacing, which the currently separated pair of

points that are nearest to each other? That to me would look like this pair

that's on the rightmost part of the picture.

The only way to merge two clusters and make the spacing actually go up is to

merge the clusters containing the pairs of points determining the current

spacing. So in this case, two different clusters

with two points, each would collapse into a single cluster with four points.

Let's assume that we wanted three clusters, clusters anyways, which is

where we are. So at this point, the greedy algorithm is

going to halt. So let me now spell out the pseudocode of

the greedy algorithm more generally, but it's exactly what you'd expect given

the discussion so far. All right.

So, I'd like you to stare at this pseudocode for ten seconds or however

long you need and try to relate this to an algorithm that we've seen in the

course. In particular, an algorithm that we've

seen quite recently. I hope it reminds you strongly of an

algorithm we've already covered. Specifically, I hope you see a strong

resemblance between this greedy algorithm and Kruskal's algorithm for computing a

minimum cost spanning tree. Indeed, we can think of this greedy

clustering algorithm as being exactly the same as Kruskal's minimum cost spanning

tree algorithm except aborted early. Stopped when there's k components

remaining, that is before the final k - 1 edge

additions. So, just to make sure the correspondence

is clear. What is the graph?

What are the edges? What are the edge costs?

Well, the objects in the clustering problem, that is the points.

Those correspond to vertices in a graph. The other part of the input of the

clustering problem are distances, which are given for every pair of points.

So those play the role that edge costs were playing in the minimum spanning tree

problem. Since we have a distance or an edge cost

for every single pair of points, we can think of the edge set in the

clustering problem as being the complete graph because we have an edge cost or a

distance for every single pair. So this type of agglomerative clustering

has a name. This idea of fusing components one at a

time using MST-lik criterion. It's called single-link clustering.

So a single n clustering is a good idea. If you work at all with clustering

problems or unsupervised learning problems, it definitely should be a tool

in your toolbox. we're going to justify its existence in

one particular way in the next video, when we show that it does indeed maximize

the spacing over all possible k clusterings.

But even if you don't care about the spacing objective function per se, you

want to be familiar with single-link clustering.

It has many other nice properties as well.