0:15

Hello. Welcome to Lesson Two in Module 16.

This lesson is actually going to introduce how to perform statistical anomaly detection.

The idea here is,

that you can use visual or descriptive statistical techniques

to find anomalies in a dataset.

And we're going to look at some of the ways that you can do this,

so that by the end of this lesson you should be able to explain how

visual and statistical methods can be used to find anomalies,

for instance, fraud in a large dataset.

Imagine credit card transactions and you're trying to find fraud.

You'll also learn how to develop visualizations in Python that can be used to identify

anomalies and apply statistical techniques in Python to find anomalies.

This lesson only has the course notebook.

There are three different things that I want you to work on.

The first is visual analysis.

The second is the actual types of outliers you might see.

And the last is statistical approaches to finding anomalies.

So the first thing is actually using visualizations.

Now typically, what you do with visualizations is you're trying to explore the data,

understand what the data is telling you.

And often we look for clusters, or clumps of data,

or modes in a histograms,

understand where most of the data is.

But with outlier detection or anomaly detection,

we're typically looking not for where most of the data is,

but where most of the data isn't.

Where those small features are that are by themselves.

So we could look at this and we could say look there's

clearly most of the data right here if we put a box around it.

We're going to be able to take that data out and then look at

the low density regions where there might be outliers.

So we do this with a histogram.

We see our histogram,

we can of course apply a KDE to it.

That's what we're seeing here, a histogram of the four different iris dataset features.

You could see the histograms here.

Notice there's by models. You might be tempted to think these are outliers,

but if you remember some of the data were off by themselves.

This is still quite a bit of data,

probably not an outlier.

Now we could look at this in two dimensions as well,

and here we're seeing the two dimensional distribution.

There's a mass clumping of data here and a nice clumping of data here in the sepal.

In the petal, it's the same thing,

a big clump here but a nice relationship between them.

So what about Outlier types?

There's a lot of ways you can get

outliers and not thinking just about fraud but in general,

ways you may get data that are outlier.

An outlier doesn't have to mean it's a bad data point.

I want to emphasize that. It could simply be that you have an extreme value.

Perhaps they're following an expected distribution,

but they just have a lot of scatter for some reason.

Maybe the process by which the data was measured was a high noise process.

It's like when you're on your cell phone and you start getting

into an area where the signal is not very good,

it gets very noisy.

And you could still be talking but it's very hard to hear,

that's a higher noise environment.

That's what we mean by extreme values.

Sometimes humans make errors.

You may have somebody entering data into a spreadsheet

and they put the wrong values in a different column.

And so we get this transcription errors.

We could also have incorrect measurements.

Somebody simply makes the measurement wrong or calibrates a machine wrong.

So what I do here is I actually I'm going to

show the visualizations again that we saw before.

But now what we've done is we've added different types of data.

We've added that high noise,

the incorrect column, and the wrong units.

So when you look at this you could see well there's

regular data here and then this new data is out by itself.

This high noise. That makes it a little easier to detect.

You might say, well this wrong units it's all out here.

Sometimes it's easier to find these when

you actually look at them in a joint distribution.

So if we were to look at this sepal with sepal length, we might see that.

And we'll explore that more in the future.

The last thing I wanted to emphasize here was statistical detection.

One way you might do this is to say,

let's take the basic statistics of our features,

and see what they are,

and then if we have outliers we can actually say,

okay let's do trim statistics.

And the idea is that if we remove features,

there are instances that are at the edges of our distribution,

and compute a robust mean,

and a robust standard deviation,

it may be easier to identify outliers.

So that's what we've done here.

We calculate the median, and the mean,

and standard deviation for our data.

And then we calculate the trimmed mean and the trimmed standard deviation.

Then we add the noisy data into our sample and we compute the same values.

And you could see that the mean and median are still pretty consistent.

The standard deviation is much higher.

And then when we go to the noisy data,

the mean didn't change that much,

but the the standard deviation dropped a lot.

And it's still higher than it was for the original data.

So this gives a handle of how that noise may be impacting your statistical measurements.

So we can then make a plot,

and we could see here's our data set that same distribution.

We can then say, well we can apply

our trimmed statistics with two sigma and three sigma lines.

And we can say, let's do a three sigma cut and throw any data outside that out.

And that would be an example of trying to remove what we think are bad data.

So hopefully, that gives you a bit of a feel for how that works.

But now we can look at this in a different dimensionality.

We can actually look at this in two dimensions.

And when we do it in two dimensions,

you can see these dataset were really

hard to pull out before, but now they're really easy.

And these high noise one same thing.

So you can imagine trying to build

a representation of what this data typically looks like it would mostly be right here.

If we were to draw a circle, pull out that high density,

and we get rid of a lot of the anomalies or different types of outliers if we did that.

Now keep in mind, when you're looking at data,

you don't have the outliers marked in a different sense.

But you still would see data out here and say,

these probably aren't right.

And this is suspicious because it's all down here this clump all by itself.

So you probably would come in here and say these are the good data.

You still have some anomalies but you get rid of most of them.

So hopefully I've given you a feel for how

statistical and visual detection can be used to try to identify outliers or anomalies.

And you've gotten a feel for how you might be

able to apply that in a more general setting.

If you have any questions, let us know.

And of course, good luck.