你是否好奇数据可以告诉你什么？你是否想在关于机器学习促进商业的核心方式上有深层次的理解？你是否想能同专家们讨论关于回归，分类，深度学习以及推荐系统的一切？在这门课上，你将会通过一系列实际案例学习来获取实践经历。在这门课结束的时候，

Loading...

来自 University of Washington 的课程

机器学习基础：案例研究

7351 个评分

你是否好奇数据可以告诉你什么？你是否想在关于机器学习促进商业的核心方式上有深层次的理解？你是否想能同专家们讨论关于回归，分类，深度学习以及推荐系统的一切？在这门课上，你将会通过一系列实际案例学习来获取实践经历。在这门课结束的时候，

从本节课中

Clustering and Similarity: Retrieving Documents

A reader is interested in a specific news article and you want to find a similar articles to recommend. What is the right notion of similarity? How do I automatically search over documents to find the one that is most similar? How do I quantitatively represent the documents in the first place?<p>In this third case study, retrieving documents, you will examine various document representations and an algorithm to retrieve the most similar subset. You will also consider structured representations of the documents that automatically group articles by similarity (e.g., document topic).</p>You will actually build an intelligent document retrieval system for Wikipedia entries in an iPython notebook.

- Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering - Emily FoxAmazon Professor of Machine Learning

Statistics

In this module we've talked about

a document retrieval task and we've also talked about a notion of clustering where

we're trying to uncover some underlying structure in the data and we talked

about many different areas in which this notion of clustering can be really useful.

So let's go through the workflow for a clustering algorithm.

And if you feel like you already know this at this point because you've seen this

workflow now in two other modules.

Well, wake up!

Because this one's going to be a bit different.

Okay, so let's talk about our training data.

Here, our training data for

a document clustering task is going to be a document ID and

document text table.

So we have a whole bunch of documents.

And we have all the texts associated with each of those.

And then we're gonna extract some set of features.

And we talked about a lot of different ways to represent a document.

But the one that I'll use just as an example here is our tf-idf.

Term frequency-inverse document frequency representation.

And then what we're gonna do is we're gonna try and

cluster our documents based on this representation.

So we're gonna put these features through some machine learning model.

Which in this case is a clustering model.

And we're gonna output for each document a cluster label.

So the output white hat is our cluster label.

Okay, so here's where things get interesting,

because we wanna assess the accuracy of our cluster labels.

Well, in this case we don't have true cluster labels, so

I should say this is our predicted, or estimated, cluster label.

But we don't have a true cluster label to compare against.

So this y here, it does not exist.

And that's because, as we talked about, we're in an unsupervised learning setting.

Unsupervised.

Okay.

So we don't have that but

somehow we wanna assess some measure of accuracy of our clustering.

So let's draw a little picture here which is gonna be our Voronoi tessellation and

our k-means algorithm where we have some set of cluster centers and we have data.

I should say our data look like this, I don't know,

I'm just gonna draw some points here.

And the measure of accuracy that we're gonna use,

the way we're gonna measure our quality is to look at how coherent our clustering is.

So we're gonna look at the distances from each observation to

their assigned cluster center.

And a good clustering algorithm has those distances very small.

Okay, so the goal is to minimize these distances and so what we see is to measure

this accuracy, to measure these distances, what we need is our data.

We need our tf-idf vectors.

So those are gonna come in here, and then we also need the cluster centers.

And so W hat, that's our current estimate of,

that's our model parameter here and the k means algorithm.

It's our cluster, whoops.

Let's see if we can spell that correctly, cluster centers.

That's what W hat represents.

And so of course to measure these distances we also need W hat.

So instead of having actual cluster labels to assess accuracy we're

gonna take our document representation and our cluster centers.

Plug it into this quality measure,

which is looking at distances to cluster centers.

That's our measure of error, though it's not really an error.

It's just a measure of.

Oops.

Just a measure of quality.

So I won't put that word there.

Okay, so I think that's a little confusing.

But let's just write distances to cluster centers.

And then what's our algorithm?

We're talking about k-means as a method of doing clustering.

Of course there are others, but let's focus on k-means, well,

what's k-means doing?

Let's just redraw this diagram, actually, I can just switch to another color.

That will save us some time here.

Well, k-means is trying to minimize this distance, or the sum of these distances,

and the way it's doing that is iteratively it's updating so

this was our W hat that we had before and we're moving it

to a new W hat that represents the center of mass of these points.

So these points are getting shifted.

And this point will go directly on top of that observation

and so this is the work flow of clustering.

Let's just say it at a high level one more time.

We take our documents, we represent them in some way, using either raw word counts,

tf-idf, normalizations of these things.

Lots of different bigrams, trigrams, things we can look at for

our document representation.

Then our clustering algorithm, like k-means is producing clustering labels and

iteratively, we're looping through here again and again updating

our cluster centers, that's the parameters of this clustering model.

By looking at how far our assigned observations are to those cluster centers.

So in this module in contrast to other modules we actually presented you with

some of the algorithmic details behind the methods that we're looking at.

Specifically for clustering, we talked about the k-means algorithm, and then for

our document retrieval task, we also talked about doing nearest neighbor

search, and provided some of the algorithmic details for that, and

you specifically explore that in an ipython notebook for

doing Wikipedia entry retrieval.

So, at this point, you really should be able to go out there and

build a really cool retrieval system for doing news article retrieval.

Or any other really, really, really cool retrieval that I can't think of right now.

But of course there's lots of interesting examples.

So go out there and think of ideas that I can't think of right now.

[MUSIC]