So in this spam data we can do, actually do

this for a more variables than just the two variables.

This is why principal components may be useful.

So here, I'm creating a variable that's just going to

be the color we're going to color our points by.

So it's color equal to black if you are not

a spam and color equal to red if you're a spam.

And this variable here, or this statement here

calculates the principal components on the entire data set.

So you'll notice that I've applied a function

of the data set, the log 10 transform.

And I've added one.

I've done this to make the data look a little bit more Gaussian.

Because some of the variables are normal

looking, because some of the variables are skewed.

And you often have to do that

for principal component analysis to look sensible.

So then I calculated the principal components of the entire data set.

So in this case I can now again

plot principal component one, versus principal component two.

Principle component one is no longer a very easy addition of two variables.

It might be some quite complicated combination

of all the variables in the data set.

But it's the combination that explains the most variation in the data.

Principle component two is the combination that explains the second most variation.

And principal component three explains the third most and so forth.

So if I plot principal component one.

That's just a variable that I've calculated.

Versus variable principal component two

that's another variable that I've calculated.

Then I color them by the spam indicator.

So, whether each point, so each of

these points corresponds to a single observation.

The red ones correspond to spam observations

and the black ones just ham observations.

You could see that in principal component

one space, or along the principal component one.

There's a little bit of separation of the ham messages from the spam messages.

In other words the spam messages tend to have

a little bit higher values than principal component one.

So this is a way to reduce the size of your data set while still

capturing a large amount of variation which

is a, a, the idea behind feature creation.