你是否好奇数据可以告诉你什么？你是否想在关于机器学习促进商业的核心方式上有深层次的理解？你是否想能同专家们讨论关于回归，分类，深度学习以及推荐系统的一切？在这门课上，你将会通过一系列实际案例学习来获取实践经历。在这门课结束的时候，

Loading...

来自 University of Washington 的课程

机器学习基础：案例研究

7673 个评分

你是否好奇数据可以告诉你什么？你是否想在关于机器学习促进商业的核心方式上有深层次的理解？你是否想能同专家们讨论关于回归，分类，深度学习以及推荐系统的一切？在这门课上，你将会通过一系列实际案例学习来获取实践经历。在这门课结束的时候，

从本节课中

Classification: Analyzing Sentiment

How do you guess whether a person felt positively or negatively about an experience, just from a short review they wrote?<p>In our second case study, analyzing sentiment, you will create models that predict a class (positive/negative sentiment) from input features (text of the reviews, user profile information,...).This task is an example of classification, one of the most widely used areas of machine learning, with a broad array of applications, including ad targeting, spam detection, medical diagnosis and image classification.</p>You will analyze the accuracy of your classifier, implement an actual classifier in an iPython notebook, and take a first stab at a core piece of the intelligent application you will build and deploy in your capstone.

- Carlos GuestrinAmazon Professor of Machine Learning

Computer Science and Engineering - Emily FoxAmazon Professor of Machine Learning

Statistics

[MUSIC]

We talked about accuracy and errors that a classifier might make.

But there are different kinds of errors.

So this kind of errors are called types of mistakes.

It's important to look at the types of mistakes a classifier might make.

And one way to do that is through what's called a confusion matrix.

So, let's take that in a little bit.

So we're talking about the relationship between the true label and

whatever classifier predicts, so the predicted label.

So let's say that if the true label is positive, and we predict a positive value

for that sentence, we call that a true positive because we got it right.

Similarly if the true label is negative and

we predict that negative we call that a true negative.

That's good cuz we got that right.

Now, there is two kinds of mistakes that I can make.

So, for example, if the true label's positive, but

we predicted as negative, we call that a false negative.

We said it was negative, but that was false cuz it's positive.

Similarly, if the true label is negative when we predicted as positive,

we call that a false positive.

It was negative, but we predicted it as positive.

And false positives and false negatives can have different impacts

on what can happen in practice with your classifier.

So let's look at a couple practical examples of that.

So let's look at two applications, and

what the cost has of false positives versus false negatives.

So, if you consider spam filtering, a false negative is

an email that was spam but went into my folder it thought it was not spam.

So that's just annoying I got another spam email in my inbox.

Maybe it's bad but not super bad.

However, if you look at a false positive

that's an email that was not spam that got labeled as spam, went to my spam filter.

I never saw it, I lost that email forever.

That has a higher cost.

Now we can also look at medical diagnosis or

other applications as a second application.

So what's a false negative in medical diagnosis?

False negative is, there's a disease that I have but

it didn't get detected, so the classifier said it was negative.

They don't have the disease.

So in this case, the disease goes untreated,

which can be a really bad thing.

But the false positives can also be a bad thing.

That is, I classify as having the disease when I never had the disease.

In this case I get treated potentially with a really bad drug or

false side effect for diseases that I never had.

So it's a little bit unclear what's worse, having a false positive or

a false negative.

In medialc complications it really depends on the cost of the treatment and

how many side effects it had versus how bad the disease can be.

Now this relationship between the true label and the predicted label,

false positive, false negatives, is called the Confusion Matrix.

This matrix we just do.

So for example, let's say that we have a setting with a 100 test examples.

And we have of those, 60 positive and 40 are negative.

So there's a little bit of class imbalance but not too much.

So of those 60 true positives, if I say I got 50 of them correct,

well of the 42 negatives

I got 35 of them correct.

Let's see what we've learned.

So out of the 100 examples I got 85 correct.

So we can talk about our accuracy.

Accuracy is 85 correct over 100, which is 0.85.

And we can also discuss the true positives and the true negatives.

Sorry, the false positives and the false negatives, so of the positives,

I got labeled as negative, that's a false negative.

And that was ten, I had ten false negatives and on the other hand,

of the true negatives we get five false positive.

So in this example, we got 85% accuracy.

We got a higher false negative rate, than we had a false positive rate.

Now those words, false positive,

false negative, apply only for minor classification for two classes.

But the ideal confusion matrix works well even when you have more classes.

So let's talk about a simple example of that.

So let's say that I have 100 test examples and this is for

medical diagnosis, so there's 3 classes, healthy, cold or flu.

And of the 100 test subjects we had 70 with that were healthy,

20 that had cold, and 10 that had the flu.

And let's suppose that we got 60 correct for

healthy, we got 12 correct for cold,

and we got 60, 12, 8 correct for flu.

So, the total, our accuracy, here,

was 80, which is 60 plus 12 plus 8 divided by 100.

So that 0.8, 80% accuracy.

But we can talk about the false predictions.

So from healthy there were ten mistakes.

And we can say it's more common to confuse healthy with having a cold

than it is with having the flu, because the flu is a more complex disease so

we might have those ten mistakes.

Eight were confused with code and two were confused with flu.

Cold can go both ways.

So we made eight mistakes.

Maybe you can say half of them got confused with healthy and

half of them got diagnosis something stronger the flu.

Well of the two mistakes for the flu, then maybe we say that we

made no mistakes often, nobody that came in for flu was thought oh you're healthy.

But two of those ten were thought to have just a cold and not the flu.

So this is an example of a confusion matrix, we can really

understand the types of mistakes we made and we can interpret those.

And this is a really important thing to do in classification