This lecture's about preprocessing predictor variables. And, so as we saw on previous lectures, you need to plot the variables upfront so you can see if there's any sort of weird behavior of those variables. And sometimes predictors will look very strange or the distribution will be very strange, and you might need to transform them in order to make them more useful for prediction algorithms. This is particularly true when you're using model based algorithms. Like linear discriminate analysis, naive Bayes, linear regression and things like that. We'll talk about all those methods later in the class, but just keep in mind that pre-processing can be more useful often when you're using model based approaches, then when you're using more non parametric approaches. So, why preprocess? Here again I'm loading the later, the caret package, and I'm learning the kernlab package and then I'm attaching the spam data. Again just like I talked about previously, when you're deciding how to preprocess data, or how to explore data we only look at the training set. So, we split data right away into training and testing data and we set the testing data aside for later. Now if I look at one of the variables, so again, this is spam data, so we're trying to predict whether the data is spam or if it's good emails, ham. And, so the variables are things like how many capitals do we see in a row? What's the run length for the number of capitals in a row in an email? If you take, make a histogram of those values, you see, for example,. That almost all of the run links are very small, but there are a few that are much, much larger. This is an example of a variable that is very skewed, and, so it's very hard to deal with in model based predictors and so you might want to preProcess. So, if you take the mean of this variable, it's about 4.7. But the standard deviation is huge, it's much much larger. So, it's much more highly variable variable. And so, what you might want to do is do some sort of preprocessing, so the machine learning algorithms don't get tricked by the fact that it's skewed and highly variable. So, one way that you can do this is by basically standardizing variables, and the usual way to standardize variables, is to take the variable values, and subtract their mean. Then put, so you take the value, subtract the mean, and then divide that whole quantity by the standard deviation. When you do that the mean of the variables that you have will be zero. And the standard deviation will be one, so that will reduce a lot of that variability that we saw previously. And it will, standardize the variable somewhat. One thing to keep in mind is, again, when we apply a prediction algorithm to the test set. We have to be aware that we can only use parameters that we estimated in the training set. In other words, when we apply this same standardization to the test set, we have to use the mean from the training set, and the standard deviation from the training set, to standardize the testing set values. What does this mean? It means that when you do the standardization, the mean will not be exactly zero in the test set. And the standard deviation will not be exactly one, because we've standardized by parameters estimated in the training set, but hopefully they'll be close to those values even though we're using not the exact values built in the test set. You can also use the preProcess function to do a lot of standardization for you. So, the preprocess function is a function that is built into the caret package. And here I'm passing it all of the training variables except for one, except for the 58th in the data set, which is the actual outcome that we care about. And I'm telling it to center every variable and scale every variable. That will do that same transformation that we talked about previously to the data, where you subtract the mean and divide by the standard deviation. And you can see that by looking at the mean of the value capitalAve, just like we did before. And you can see that after using the preProcess function the mean is zero, and the standard deviation is one. So, preprocess can be used to perform a lot of the preprocessing tool, techniques that you, you used to have to do by hand. The other thing that you can do is you can use the object that's created using the preprocessing technique to apply that same preprocessing to the test set. So, here this preObj was the object used on the previous slide. That was the object that we created by preprocessing the training set. So, looking at that value we can see, now, if we pass the testing set to the predict function. Then what it'll do is it will take the values calculated on the preprocessing object and apply them to the test set object, and so again, in the pre, post process test set data the mean won't exactly be equal to zero for any variable and the standard deviation won't exactly be equal to one, but they'll be close, because we used the training set values to normalize. You can also pass the preprocessed commands directly to the train function in caret, as an argument. So, for example here we can send to the preprocessed argument of the train function, the command, the parameters center and scale, and that will center and scale all of the predictors, before using those predictors in the prediction model. The other thing that you can do is do other kinds of transformation. So, centering and scaling is one approach, and that will take care of some the problems that we see in these data. You can remove, very strongly biased predictors or predictors that have super high variability. The other thing that you can do is use other different kinds of transformations one example is the box cox transforms. Box cox transforms are a set of transformations that take continuous data, and try to make them look like normal data and they do that by estimating a specific set of parameters using maximum likelihood. So, if I use the preprocess function and I tell it to perform box cox transformations on each of the variables, and then I predict. Each of the different variables using that preprocess object on the training set. I can look at the capital average values, and I can make a histogram of those values. And now, remember in the original plot on the histogram, they looked like a huge pile at zero and a few values that were large. And now you see something that look a little bit more like a normal distribution, a little bit more like a bell curve here. You will notice that it doesn't take care of all of the problems. So, for example there's still a stack set of values here at zero and in the Q-Q plot, so this is showing the theoretical quantiles of the normal distribution. Versus the sample quintiles that we calculated for our preProcess data, you can see that they don't perfectly line up and in particular there's this again chunk down here at the bottom these don't lie on a perfect 45 degree line, and that's because if you have a bunch of values that are exactly equal to zero this is a continuous transform and it doesn't take care of values that are repeated. So, it doesn't take care of a lot of the problems that would happen, would occur with using a variable that's highly skewed though. So, the thing that we can do is also impute data for these data sets. So, it's very common to have missing data. And when you're using missing data in the data sets, the prediction algorithms often fail. Prediction algorithms are built not to handle missing data in most cases. So, if you have some missing data. You can impute them using something called k-nearest neighbor's imputation. So here we set the seed again, because this is a randomized algorithm, and we want to get reproducible results. And I take just these capital average values and, I create a new variable called CapAve. Then I generate randomly a bunch of values using the rbinom function to set equal to NA and I set those values to be missing. So, now this variable capAve is exactly like the capitalAve valuable only it has a subset of values that are missing. So, now I want to know how would I handle those missing values? I did this so that we could see how we could handle missing values in a dataset. So, one thing that you going to do is you going to get news as preProcess function and tell it to do k-nearest neighbors imputation. K-nearest neighbors computation find the k. So if k equal to ten, then the 10 nearest, data vectors that look most like data vector with the missing value, and average the values of the variable that's missing and compute them at that position. So, if we do that, then we can predict on our training set, all of the new values, including the ones that have been imputed with the k-nearest imputation algorithm. We can then standardize those values, using the same standardization procedure that we did before, by subtracting the mean and divided by the standard dev, deviation. One thing you can do is you can look at the comparison between the actual, in this case, when we set some of the values to be equal to NA in advance, we can look at the values that were imputed, and the values that were truly there before we removed them and made them NAs. And we can see how close those two values are to each other. And so, you can see, for example, that. Those values are relatively close to each other. Most of the differences are very close to zero. Here you can see the values are mostly very close to zero. So the imputation work relatively well. You can also do look at just the values that were imputed. So again here I'm looking at a capAve quantile of the same difference between the imputed values. And the true values, they're only for the ones that were missing. And here you can see again, most of the values are close to zero, but here we're only looking at the ones we're missing, so clearly some of them are more variable than previously. And then you can look at the ones that were not the ones that we selected to be NA. And you can see that they're even closer to each other, and so the ones that got imputed are a little bit further apart. But aren't that much further apart. So, there is a lot more to learn about training and testing sets in terms of transformations. But train, but the things to keep in mind are that training in test sets must be processed in the same way. The caret package handles a lot of this under the hood in the sense that if you train a data set using preProcess functions. Built into the train function and caret way it applies that preprocessed function to the test set. It will handle all of the preprocessing in the correct way for you. In general, you need to pay attention to the fact that anything you do to the training set. Will create a set of parameters. You must use only those parameters when you apply it to the test set. You can't estimate new transformations on the test set alone. And that means that the test set transformations will likely be imperfect. Especially if the test and training sets are diff, collected at different times or in different ways. Some of the transformations won't necessarily work as well. All of the tra, transformations I'm talking about, so far other than imputation are based on continuous variables. When dealing with factor variables it's a little bit more difficult to know what's the right transformation. Most machine learning algorithms are built to deal with either binary predictors, in which the binary predictors are not pre-processed. Or continuous predictors in which case sometimes it's expected that, the data are preprocessed to look more normal. You can go to this link that I've linked here to look, into more information about how to preProcess data for prediction using the caret package.