0:00

This is a lecture a lecture about

Â preprossings, preprocessing covariants with principal components analysis.

Â The idea is that often you have multiple quantitative

Â variables and sometimes they'll be highly correlated with each other.

Â In other words, they'll be very similar to being the almost the exact same variable.

Â In this case, it's not necessarily useful to include every variable in the model.

Â You might want to include some summary that

Â captures most of the information in those quantitative variables.

Â 0:23

So, as an example I'm going to use the spam data set again.

Â So I've loaded the caret package here, and I've loaded

Â the kernel app package, and loaded the spam data set.

Â I again, create a training and test set and I'm going to

Â perform only the operations on the training set that I'm doing here.

Â So that all exploration and model creation and

Â feature building has to happen in the training set.

Â The first thing that I do is that I

Â leave out just the 58th column of this training set.

Â Which in this case is the outcome.

Â So I'm looking at all the other predictor variables.

Â And I calculate the correlation between all those columns.

Â The correlation between all predictor variables.

Â And I take its absolute value.

Â So I'm looking for all the predictor variables that that have

Â a very high correlation or are very similar to each other.

Â Every variable has a correlation of 1 with itself.

Â So I'm not interested in those variable.

Â You know, removing variables that have high

Â correlations with themselves, since they all do.

Â So I set the diagonal of those matrix, that comes out to be equal to 0.

Â That's basically just setting the correlation

Â between variables with itself, equal to 0.

Â And then I look, which of these variables have a high correlation with each other?

Â In other words, which of the variables have a correlation greater than 0.8?

Â So it turns out two variables have a very high correlation with each other.

Â They are the num415 and num857.

Â So it's how, if the number 415 and

Â 857 appears in the email and frequently appears together.

Â This is likely because there's a phone number that has similar variables there.

Â 1:51

So, if I look at the spam dataset, at the columns 34

Â and 32, which I got from getting that from the previous correlation variable.

Â I see that it's these two variables, these

Â two columns that are highly correlated with each other.

Â And if I plot those two columns against each other, I see exactly what I'd expect.

Â So, the frequency of four 415 and 857 is incredibly highly

Â correlated, this basically lie perfectly on a line with each other.

Â So, as the number of 415 appears more frequently, so does the number 857.

Â So the idea is, including both of these predictors

Â in the model might not necessarily be very useful.

Â And so, the basic idea is, how can we take those variables

Â and turn them into, say, a single variable that might be better?

Â And one idea is to think about it as a weighted

Â combination of those predictors that explains most of what's going on.

Â So the idea is to pick the

Â combination that captures the most information possible.

Â And the benefits here are, you're reducing the number

Â of predictors you need to include in your model.

Â So that's nice.

Â And you're also reducing noise.

Â In other words, you're averaging or combining

Â variables together, so you might reduce them together.

Â So you do this in a clever way, you

Â actually gain quite a bit by doing principal component analysis.

Â So one idea to think about is just.

Â Basically what you're trying to do is figure out a

Â combination of these variables that explain close to the variability.

Â So just as an example, here's a combination I could do.

Â I could say I could take 0.71 times the 415 variable plus 0.71 times 857 variable.

Â And create a new variable called x.

Â Which is basically the sum of those two variables.

Â Then I could take the difference of those two variables.

Â By basically doing 0.71 times 415 minus 0.71 times 857.

Â So this is basically adding, x is adding the

Â two variables together, y is subtracting the two variables.

Â 3:44

So then if I plot those variables versus each other, when I add

Â them up, that's the x-axis, and when I take the difference, that's the y-axis.

Â And so you can see most of the variability is happening in the x-axis.

Â In other words there's lots of points all spread out across the x-axis,

Â but most of the points are clustered right here at 0 on the y-axis.

Â So that almost all of these points have a y value of 0.

Â So, the adding the two variables together captures most of the

Â information in those two variables and

Â subtracting the variables takes less information.

Â So the idea here is we might want to

Â use the sum of the two variables as a predictor.

Â That will reduce the number predicted that we will

Â have to use and renew some of the noise.

Â So there are two related problems to how you do this in a more general sense.

Â And so the ideas are find a new set of variables based on the variables

Â that you have that are uncorrelated and explain as much variability as possible.

Â In other words from the previous plot, we're looking for

Â the x variable which has lots of variation in it.

Â And not the y variable which is almost always 0.

Â So if you put the variables together in

Â one matrix, create the best matrix with fewer variables.

Â In other words this is lower rank if

Â you're talking mathematically that explains the original data.

Â These two problems are very closely related to each other.

Â And they're both the idea that, can we use

Â fewer variables to explain almost everything that's going on.

Â The first goal is a statistical goal and

Â the second goal is a data compression goal.

Â But they're also, they're both very useful for machine learning.

Â 5:14

So there's two related solutions and they're very similar to each other.

Â So if x is a matrix with a variable in each column, and an

Â observation in each row, like a data in frame you usually have in R.

Â Then the singular value decomposition is a matrix decomposition.

Â So it takes that data frame X, and breaks it up into

Â three matrices, a U matrix, and D matrix, and a B matrix.

Â And the V, the columns of U are called the left singular vectors.

Â And the columns of B are called the right singular vectors.

Â And D is a diagonal matrix.

Â Those are called the singular values.

Â 5:44

You will learn about this in getting data

Â or exploratory data analysis if you've taken those classes.

Â The principle components are equal to the right singular

Â vectors if you scale the data in the same way.

Â In other words, the solution to both of those problems that I talked

Â about on the previous slide is the same if you do the right scaling.

Â So the idea here is, these variables in V are

Â constructed to explain the maximum amount of variation in the data.

Â So, just to show you how this works in

Â a real example, suppose we take that spam data set.

Â And we just take those two variables that were highly correlated with each other.

Â Variables 34 and 32.

Â So then we do principal components.

Â Same as the singular value decomposition on the, on that

Â small data set that just consists of those two variables.

Â If we plot the first principle component versus the second principle component.

Â We see a plot that is very similar to the one that I showed you earlier.

Â Where the first principal component looks like adding the two variables together.

Â And the second principal component looks a lot

Â like subtracting the two variables from each other.

Â So why would we do principal components, instead of just adding and subtracting?

Â Well, principal components allows you to perform this operation,

Â even if you have more than just two variables.

Â You may be able to reduce all of the variables down into a very small number

Â of combinations of sums and differences and weighted

Â sums and differences of the variables that you've observed.

Â So, using principal components can let you look at a

Â large number of quantitative variables and reduce it quite a bit.

Â 7:11

The one thing that you can also look at

Â it in this principal component object is the rotation matrix.

Â Which is basically how it's summing up the

Â two variable to get each of the principal components.

Â And so here you can see why I put 0.71

Â in the sum and the difference on the first slide.

Â So, principal component one is just 0.7081 times num14, 415, and 0.7061 times num857.

Â Principal component two is just the difference

Â again, by multiply by 0.7061, and minus 0.7081.

Â So basically, in this particular case the first principal component, the one

Â that explains the most variability is just adding the two variables up.

Â And the variable that explains the second most variability in these

Â two variables is the taking the difference between the two variables.

Â 7:59

So in this spam data we can do, actually do

Â this for a more variables than just the two variables.

Â This is why principal components may be useful.

Â So here, I'm creating a variable that's just going to

Â be the color we're going to color our points by.

Â So it's color equal to black if you are not

Â a spam and color equal to red if you're a spam.

Â And this variable here, or this statement here

Â calculates the principal components on the entire data set.

Â So you'll notice that I've applied a function

Â of the data set, the log 10 transform.

Â And I've added one.

Â I've done this to make the data look a little bit more Gaussian.

Â Because some of the variables are normal

Â looking, because some of the variables are skewed.

Â And you often have to do that

Â for principal component analysis to look sensible.

Â So then I calculated the principal components of the entire data set.

Â So in this case I can now again

Â plot principal component one, versus principal component two.

Â Principle component one is no longer a very easy addition of two variables.

Â It might be some quite complicated combination

Â of all the variables in the data set.

Â But it's the combination that explains the most variation in the data.

Â Principle component two is the combination that explains the second most variation.

Â And principal component three explains the third most and so forth.

Â So if I plot principal component one.

Â That's just a variable that I've calculated.

Â Versus variable principal component two

Â that's another variable that I've calculated.

Â Then I color them by the spam indicator.

Â So, whether each point, so each of

Â these points corresponds to a single observation.

Â The red ones correspond to spam observations

Â and the black ones just ham observations.

Â You could see that in principal component

Â one space, or along the principal component one.

Â There's a little bit of separation of the ham messages from the spam messages.

Â In other words the spam messages tend to have

Â a little bit higher values than principal component one.

Â So this is a way to reduce the size of your data set while still

Â capturing a large amount of variation which

Â is a, a, the idea behind feature creation.

Â 9:54

You can do this in caret as well.

Â So you can do it using the pre-process function.

Â So this is just using, basically doing

Â a similar type operation with a caret package.

Â You pass the pre-process function.

Â The same data set you did before.

Â You tell it what methods you use.

Â In this case you tell it to use principal component analysis or PCA.

Â You tell it the number of principal components to compute.

Â And then, what we can do is you can actually calculate the

Â values of each new principle compo, so the principle component are two variables.

Â There is principle component one, principle component two.

Â And they're basically a model that you fit to the data.

Â So the idea is that if you get a new observation you have

Â to predict what the principle component will look like for that new variable.

Â So we pass this pre-processed object and the data set,

Â to the predict function and that gives us the principle component.

Â If we plot them versus each other, you see spam pc

Â 1, so that's principle component 1 versus principle component 2 here.

Â And again you see a little bit of separation.

Â Between the ham and the spam messages, in

Â both principle component one and principle component two.

Â You can do this, like I showed you before, by

Â doing preprocessing with the method PCA, using the preProcess function.

Â And then you can create training predictions by using the predict function.

Â And then, fitting a model that relates

Â the training variable to the principal component.

Â So here I haven't used the full training set as the data for fitting my model.

Â I've just [UNKNOWN] the principal components for the model fitting.

Â In the test data set you have to use the same principal

Â component that you calculated in the trained video set for the test variables.

Â So the idea here is we again pass at

Â the pre-process object that we calculated in the training set.

Â But now we pass at the new testing data.

Â So this predict function is going to take

Â the principle components we calculated from training.

Â And get the new values for the test data set on those same principle components.

Â 11:51

So then, what you can do, is you can predict.

Â The, using the modelFit on the original data using the test principal components.

Â And you can use the confusionMatrix argument in caret to get the accuracy.

Â And so, here we calculated a relatively small number of

Â principal components, but still have a relatively high accuracy in prediction.

Â So principal component analysis can reduce

Â the number of variables while maintaining accuracy.

Â 12:18

So the other thing that you can do is you can

Â actually decide for this analysis to not use the predict function separately.

Â You can build it right into your training exercise.

Â So if you take the train function from the

Â caret package, and you pass it in a training set.

Â But you tell it to pre process with principal component analysis.

Â It will do that pre-processing as part of the training process.

Â And then when you do the prediction on the new data set you just pass

Â it a testing data set and will, it will actually calculate the PC's for you.

Â The reason why I showed you the more elaborate way

Â of calculating the PC's first and passing them to the model.

Â Is so that you can see what's going on under the hood.

Â When you pass a command like this, to the train function in the caret package.

Â 13:02

So this is most useful for linear type models.

Â This includes linear discriminant analysis, linear

Â and generalized linear regression, things like that.

Â It can make it a little bit harder to interpret the predictors.

Â In the case where I only had two variables, it was just the sum

Â and the difference of those variables, it was very easy to predict what that meant.

Â In general though, if you do principal

Â components on a large number of quantitative variables.

Â In each principal component might be quite a complex weighted sum of

Â the variables you've observed and so it could be very hard to interpret.

Â You have to watch out for outliers.

Â Outliers can really wreak havoc on calculating principal components.

Â So you do that by looking at

Â an exploratory analysis first and identifying outliers.

Â Doing transforms, like I did the log ten transform of

Â the data, you might do Box Cox transformations as well.

Â And again, plotting predictors to identify problems is the

Â key place to figure out where this is working out.

Â For more information, you can see the exploratory data analysis class

Â where we talk about principal component analysis and SVD in more detail.

Â And this book the Elements of Statistical Learning has

Â a quite nice, if a little bit technical overview.

Â Of how principal components work for machine learning.

Â