你是否好奇数据可以告诉你什么？你是否想在关于机器学习促进商业的核心方式上有深层次的理解？你是否想能同专家们讨论关于回归，分类，深度学习以及推荐系统的一切？在这门课上，你将会通过一系列实际案例学习来获取实践经历。在这门课结束的时候，

Loading...

来自 华盛顿大学 的课程

机器学习基础：案例研究

7053 评分

你是否好奇数据可以告诉你什么？你是否想在关于机器学习促进商业的核心方式上有深层次的理解？你是否想能同专家们讨论关于回归，分类，深度学习以及推荐系统的一切？在这门课上，你将会通过一系列实际案例学习来获取实践经历。在这门课结束的时候，

从本节课中

Clustering and Similarity: Retrieving Documents

A reader is interested in a specific news article and you want to find a similar articles to recommend. What is the right notion of similarity? How do I automatically search over documents to find the one that is most similar? How do I quantitatively represent the documents in the first place?<p>In this third case study, retrieving documents, you will examine various document representations and an algorithm to retrieve the most similar subset. You will also consider structured representations of the documents that automatically group articles by similarity (e.g., document topic).</p>You will actually build an intelligent document retrieval system for Wikipedia entries in an iPython notebook.

- Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering - Emily FoxAmazon Professor of Machine Learning

Statistics

[MUSIC]

So the first thing we need to describe is how are we gonna represent the documents

that we're looking at.

Okay, so perhaps the most popular model to represent a document is something called

the bag of words model,

where we simply ignore the order of words that are present in the document.

And the reason it's called a bag of words model is we think of taking a bag,

throwing all the words from that document into the bag, shaking it up, and

the new document we've created with the words all jumbled up

has exactly the same representation as I'll describe

as the original document where the words were ordered.

And what we're gonna do, instead of considering the structure,

the order of the words,

is we're simply gonna count the number of instances of every word in the document.

So let's look at a specific example of this.

So in this document we're gonna imagine that there's just one sentence, and that

sentence says that Carlos calls the sport futbol, Emily calls the sport soccer.

Actually I guess that's really two sentences, but that's the entire document.

And what we're gonna do to count the number of instances

of words in this very short document is we're just gonna look at a vector.

And this vector is defined over the vocabulary

in whatever language we're looking at.

So maybe one word in our vocabulary is the name, Carlos.

Another place in this vector is the index for the word sport.

And then, somewhere else we have the word futbol luckily in our English vocabulary

that I'm writing here and then let's say, Emily is this last entry.

What words are we missing?

We're missing the word calls, and of course the word the.

Okay, so how many instances of Carlos?

Well, there's only one.

How many instances of the?

We have two of the.

Two of calls, two of sports,

one of futbol and I forgot the word soccer.

One word of Emily and let just throw soccer in here,

imagining this was the index and this would be our word count factor with.

For this document, every other entry would be zero.

And all these other entries represent all the other words that are out there in

the vocabulary, like the word cat, and

dog, and tree, and every other word you can think of.

So it's a very, very long and

sparse vector that counts the number of words that we see in this document.

Okay, so we talked about this representation of our documents in terms

of just these raw word counts.

This bag of words model.

And now we want to talk about how we're gonna measure the similarity between

different documents because we're gonna use that in order to find

documents that are related to one another and so on, like we talked about before.

Carlos is reading an article, so what’s another article he might be interested in?

Okay, so imagine that this is the count factor that we have for

this article on soccer, with this famous Argentinian player, Messi.

And then there's another article here that I'm showing in blue and

the associated word counts.

And this article is about another famous soccer player, Pele.

Is that right?

>> Pele. >> Pele.

[LAUGH] So when we think about measuring similarity,

what we can do is simply look at an element-wise product over this vector.

So for every element in the vector, we're gonna

multiply the two elements appearing in these two different count vectors.

And add up over all the different elements in this vector.

So here I've done this math where we have 1 times 3,

all the other elements multiplied to 0,

except at some point that fifth entry in the vector we have 5 times 2.

And if we do this multiplication over the whole vector,

the sum of these terms is 13.

So that measures the similarity between these two articles on soccer.

But now let's compare to another article,

which happens to be something about a conflict in Africa.

And so I'm providing the examples of word counts that appear in this article.

And what we see, is when we go to measure the similarities between these articles,

using the method that I described of element-wise product, and

then adding, that the similarity here in this case is 0.

Okay, so let's talk about an issue that arises when we

use these raw word counts to measure the similarity between documents.

So to do this, let's look at these green and blue articles that we had before.

And so I'm repeating the word count vectors that we had, and

what we calculated before was the fact that the similarity between these two

articles that are both about soccer is 13.

Okay, but

now let's look at what happens if we simply double the length of the documents.

So now every word that appeared in that original document appears twice in this

twice as long document.

So, the word count vector is simply two times the word count vector we had before.

So, when we go to calculate the similarity here,

what we see is now the similarity is calculated to be 52.

So, let's think about this.

What we're saying is that two documents that

are related to each other in the same way as before.

They're both talking about the same two sports, but

one just is replicated twice is a lot more similar.

We would say, yes, Carlos is a lot more interested in this longer document.

Then what happened when Carlos was reading the shorter documents.

So this doesn't make a lot of sense when we're trying to do document retrieval.

And it biases very strongly towards long documents.

So let's think about how we can cope with this.

So one solution

is very straight forward where we're simply gonna normalize this vector.

So we take this word count vector and

we're gonna compute the norm of the vector.

And so if you guys remember computing the norm of a vector,

we simply add the square of every entry in the vector, and then take the square root.

So, in this case,

we have the square root of 1 squared plus 5 squared plus 3 squared plus 1 squared.

And that happens to be the number and so,

the resulting normalize word count vector is shown on the bottom of the slide here.

And what this does is, it allows us to place all of our articles that we're

considering, regardless of their length, on equal footing.

And then use this normalized vector when we go to do retrieval.

[MUSIC]