0:00

Hello, and welcome back.

Â If you're trying to get a learning algorithm to do a task that humans can do.

Â And if your learning algorithm is not yet at the performance of a human.

Â Then manually examining mistakes that your algorithm is making,

Â can give you insights into what to do next.

Â This process is called error analysis.

Â Let's start with an example.

Â Let's say you're working on your cat classifier, and you've achieved 90%

Â accuracy, or equivalently 10% error, on your dev set.

Â And let's say this is much worse than you're hoping to do.

Â Maybe one of your teammates looks at some of the examples that the algorithm is

Â misclassifying, and notices that it is miscategorizing some dogs as cats.

Â And if you look at these two dogs, maybe they look a little bit like a cat,

Â at least at first glance.

Â So maybe your teammate comes to you with a proposal for

Â how to make the algorithm do better, specifically on dogs, right?

Â You can imagine building a focus effort, maybe to collect more dog pictures, or

Â maybe to design features specific to dogs, or something.

Â In order to make your cat classifier do better on dogs, so

Â it stops misrecognizing these dogs as cats.

Â So the question is, should you go ahead and

Â start a project focus on the dog problem?

Â 1:27

So is that worth your effort?

Â Well, rather than spending a few months doing this,

Â only to risk finding out at the end that it wasn't that helpful.

Â Here's an error analysis procedure that can let you very quickly tell whether or

Â not this could be worth your effort.

Â Here's what I recommend you do.

Â First, get about, say 100 mislabeled dev set examples, then examine them manually.

Â Just count them up one at a time, to see how many of these mislabeled

Â examples in your dev set are actually pictures of dogs.

Â Now, suppose that it turns out

Â that 5% of your 100 mislabeled dev set examples are pictures of dogs.

Â So, that is, if 5 out of 100 of these mislabeled

Â dev set examples are dogs, what this means is that of the 100 examples.

Â Of a typical set of 100 examples you're getting wrong, even if you

Â completely solve the dog problem, you only get 5 out of 100 more correct.

Â Or in other words, if only 5% of your errors are dog pictures, then the best you

Â could easily hope to do, if you spend a lot of time on the dog problem.

Â Is that your error might go down from 10% error,

Â down to 9.5% error, right?

Â So this a 5% relative decrease in error, from 10% down to 9.5%.

Â And so you might reasonably decide that this is not the best use of your time.

Â Or maybe it is, but at least this gives you a ceiling, right?

Â Upper bound on how much you could improve performance by working on the dog problem,

Â right?

Â 3:22

But now, suppose something else happens.

Â Suppose that we look at your 100 mislabeled dev set examples,

Â you find that 50 of them are actually dog images.

Â So 50% of them are dog pictures.

Â Now you could be much more optimistic about spending time on the dog problem.

Â In this case, if you actually solve the dog problem,

Â your error would go down from this 10%, down to potentially 5% error.

Â And you might decide that halving your error could be worth a lot of effort.

Â Focus on reducing the problem of mislabeled dogs.

Â I know that in machine learning, sometimes we speak disparagingly of hand

Â engineering things, or using too much value insight.

Â But if you're building applied systems, then this simple counting procedure,

Â error analysis, can save you a lot of time.

Â In terms of deciding what's the most important, or

Â what's the most promising direction to focus on.

Â 4:19

In fact, if you're looking at 100 mislabeled dev set examples,

Â maybe this is a 5 to 10 minute effort.

Â To manually go through 100 examples, and

Â manually count up how many of them are dogs.

Â And depending on the outcome, whether there's more like 5%, or

Â 50%, or something else.

Â This, in just 5 to 10 minutes,

Â gives you an estimate of how worthwhile this direction is.

Â And could help you make a much better decision, whether or

Â not to spend the next few months focused on trying to find solutions to

Â solve the problem of mislabeled dogs.

Â In this slide, we'll describe using error analysis to evaluate whether or

Â not a single idea, dogs in this case, is worth working on.

Â Sometimes you can also evaluate multiple ideas in parallel doing error analysis.

Â For example, let's say you have several ideas in improving your cat detector.

Â Maybe you can improve performance on dogs?

Â Or maybe you notice that sometimes, what are called great cats,

Â such as lions, panthers, cheetahs, and so on.

Â That they are being recognized as small cats, or house cats.

Â So you could maybe find a way to work on that.

Â Or maybe you find that some of your images are blurry, and it would be nice if you

Â could design something that just works better on blurry images.

Â 5:57

And on the left side,

Â this goes through the set of images you plan to look at manually.

Â So this maybe goes from 1 to 100, if you look at 100 pictures.

Â And the columns of this table, of the spreadsheet,

Â will correspond to the ideas you're evaluating.

Â So the dog problem, the problem of great cats, and blurry images.

Â And I usually also leave space in the spreadsheet to write comments.

Â So remember, during error analysis,

Â you're just looking at dev set examples that your algorithm has misrecognized.

Â 6:30

So if you find that the first misrecognized image is a picture of a dog,

Â then I'd put a check mark there.

Â And to help myself remember these images,

Â sometimes I'll make a note in the comments.

Â So maybe that was a pit bull picture.

Â If the second picture was blurry, then make a note there.

Â If the third one was a lion, on a rainy day, in the zoo that was misrecognized.

Â Then that's a great cat, and the blurry data.

Â Make a note in the comment section, rainy day at zoo, and

Â it was the rain that made it blurry, and so on.

Â 7:05

Then finally, having gone through some set of images,

Â I would count up what percentage of these algorithms.

Â Or what percentage of each of these error categories were attributed to the dog,

Â or great cat, blurry categories.

Â So maybe 8% of these images you examine turn out be dogs, and

Â maybe 43% great cats, and 61% were blurry.

Â So this just means going down each column, and

Â counting up what percentage of images have a check mark in that column.

Â As you're part way through this process,

Â sometimes you notice other categories of mistakes.

Â So, for example, you might find that Instagram style filter, those fancy

Â image filters, are also messing up your classifier.

Â In that case,

Â it's actually okay, part way through the process, to add another column like that.

Â For the multi-colored filters, the Instagram filters, and

Â the Snapchat filters.

Â And then go through and count up those as well, and

Â figure out what percentage comes from that new error category.

Â 8:12

The conclusion of this process gives you an estimate of how worthwhile it might

Â be to work on each of these different categories of errors.

Â For example, clearly in this example, a lot of the mistakes we made on blurry

Â images, and quite a lot on were made on great cat images.

Â And so the outcome of this analysis is not that you must work on blurry images.

Â This doesn't give you a rigid mathematical formula that tells you what to do,

Â but it gives you a sense of the best options to pursue.

Â It also tells you, for example,

Â that no matter how much better you do on dog images, or on Instagram images.

Â You at most improve performance by maybe 8%, or 12%, in these examples.

Â Whereas you can to better on great cat images, or

Â blurry images, the potential improvement.

Â Now there's a ceiling in terms of how much you could improve performance,

Â is much higher.

Â So depending on how many ideas you have for improving performance on great cats,

Â on blurry images.

Â Maybe you could pick one of the two, or if you have enough personnel on your team,

Â maybe you can have two different teams.

Â Have one work on improving errors on great cats, and

Â a different team work on improving errors on blurry images.

Â 9:27

But this quick counting procedure, which you can often do in, at most,

Â small numbers of hours.

Â Can really help you make much better prioritization decisions,

Â and understand how promising different approaches are to work on.

Â 9:40

So to summarize, to carry out error analysis, you should find a set of

Â mislabeled examples, either in your dev set, or in your development set.

Â And look at the mislabeled examples for false positives and false negatives.

Â And just count up the number of errors that fall into various

Â different categories.

Â During this process, you might be inspired to generate new categories of errors,

Â like we saw.

Â If you're looking through the examples and you say gee, there are a lot of Instagram

Â filters, or Snapchat filters, they're also messing up my classifier.

Â You can create new categories during that process.

Â But by counting up the fraction of examples that are mislabeled in

Â different ways, often this will help you prioritize.

Â Or give you inspiration for new directions to go in.

Â Now as you're doing error analysis,

Â sometimes you notice that some of your examples in your dev sets are mislabeled.

Â So what do you do about that?

Â Let's discuss that in the next video.

Â