So, by now we've spent quite a bit of time talking about the minimum cost

spanning tree problem. There's a number of motivations why we do

that. the first, it's just a uniquely great

problem for the study of greedy algorithms.

You can propose a bunch of different greedy algorithms, and quite unusually,

it's all of them seem to work. So you get correct greedy algorithms, but

it's quite subtle what's driving their correctness.

You also get a lot of practice, arguing about graphs, and arguing about exchange

arguments, and so on. The second reason, it's been worthwhile

spending time on these algorithms. It has given us some more practice with data

structures and how to use them to speed up algorithms,

namely heaps, to speed up Prim's algorithm,

to union-find data structure to speed up Kruskal's algorithm.

The third reason that we're talking about them is they have applications in their

own rights. That's the subject of this video and the

next and we're going to focus on applications to clustering problems.

So let me begin by just talking about the goal of clustering informally.

And then, I'll let you pin me down to a precise objective function on the next

slide. So in a clustering problem, the input is

n points that we think of as being embedded in space.

And it's actually quite rare that in the underlying problem that we care about is

it actually intrinsically geometric? Is it actually intrinsically points in

space? Usually, we're representing something we

care about. Maybe it's web pages.

Maybe it's images. Maybe it's a database as points in space.

And given a bunch of objects, we want to cluster them into, in some sense coherent

groups. For those of you coming from a machine

learning background, you'll often hear this problem referred to as unsupervised

learning, meaning the data is unlabeled. We're looking for just patterns and data

where the data is not annotated. This obviously is a fairly wishy-washy

description of a problem, so let's be a little bit more precise.

We're going to assume that part of the input is what we're going to call a

similarity measure. Meaning for any two objects, we're

going to have a function giving us a number indicating how similar or really

rather how dissimilar they are to each other.

In keeping with the geometric metaphor, we're going to refer to this function as

a distance function. One thing that's cool is we don't need to

impose many assumptions on this distance function.

The one thing we're going to assume is that it's symmetric,

meaning the distance from p to q is the same as the distance from q to p.

So what are some examples? Well, if you want to really go with the

geometric metaphor, if you're representing these points as in a space

RM for some dimension M, you can just use the Euclidean distance or if you prefer

some other norm, like say, l1 or l-infinity.

In many application domains, there are widely accepted similarity or distance

measures. one example would be for sequences, as we

discussed in introductory lecture the penalty of the best alignment between two

genome fragments. So now that we have this distance

function, what would it mean to have coherent groups?

Well, things which have small distance from each other which are similar, they

should generally be in the same group, and things that are very dissimilar that

have large distance between them, you would expect to mostly be in different

groups. So how can we evaluate how good a

clustering is, how well it's doing with the job of putting nearby points together

and dissimilar points in different groups?

Well, to be honest there's many ways of approaching and formalizing that problem.

The approach we're going to take is an optimization-based approach.

We're going to positive and objective function on clusterings and then seek out

the clustering that optimizes that objective function.

I want to warn you that's not the only way to approach the problem.

There are other interesting approaches, but optimization is a natural one.

Furthermore, just like in our scheduling application, there's more than one

objective function that people study and that is well-motivated.

So one very popular objective function would be, say the k-means objective,

which I encourage you to look up and read more about.

For this lecture, I'm just going to adopt one specific objective function.

It's natural enough, but it's by no means the only one or even the primary one.

But it'll serve the purpose of studying a natural greedy algorithm related to

minimum spanning tree algorithms which is optimal in a precise sense.