0:18

The functionality that's built into the care package are some of the following.

Â So, for example, we can use the

Â preprocessing tools in the caret package to clean

Â data and get the features set up, so that they can be used for prediction.

Â We can also do, sort of cross validation and data splitting

Â within the training set, using the

Â create DataPartition and create TimeSlices functions.

Â We can also create training and test sets with the training and predict functions.

Â And we can use those to train data sets at

Â train prediction functions and apply them to new data sets.

Â We can also do model comparison using the confusion matrix function, which will

Â give you information about how well the model's did on new data sets.

Â 0:57

There a large machine learning algorithms that are built into R, so

Â these range from very popular statistical

Â machine learning algorithms like linear discriminant

Â analysis and regression to much more

Â widely used algorithms in computer science,

Â like support vector machines, classification and

Â regression trees, or random forests or boosting.

Â All of these algorithms are built by

Â a variety of different developers, all coming from

Â different backgrounds, so the interfaces that each of

Â these sort of prediction algorithms is slightly different.

Â As one example of this, consider

Â this class of different prediction algorithms that

Â you could have applied, so everything

Â from linear discriminate analysis down to boosting.

Â And so for each of these different algorithms,

Â you can imagine creating an object called objnr.

Â That object will have a different class, say

Â a linear discriminate analysis or glm and so forth.

Â And for each of these objects, we can try

Â to predict, and if we apply the predict function, we

Â have to put it pass slightly different parameters each

Â time in order to get the prediction of the outcome.

Â So, for example, from the GLM package, we have to say type

Â equals response to get the prediction of the response from that model fit.

Â Or, for example, if we want to use rpart, we want

Â to predict with type equals probability in order to predict the response.

Â In each case, they're a little bit different, and the caret

Â package provides a unifying framework that allows you to predict using just

Â one function and without having to specify all the options that

Â you might care about in order to get the same prediction out.

Â 2:22

So here's a quick example, using the caret package.

Â We'll go into the details of how this is done very specifically in later examples.

Â So here we've loaded in the caret package, and we've loaded

Â the kern lab package as well to get the spam data set.

Â And so, what we can do first is partition

Â the data setup into a training and a test set.

Â Here I'm going to use the spam type, and I'm going to

Â split it into the, training set and the test set.

Â I'm going to say we're going to use about 75% of

Â our data to train the model and 25% to test.

Â Then what I can do is I can actually subset the data

Â into the training data using the in train, bit, the in train, object

Â that comes out from create data partition, and I can create the testing

Â data set by finding all those samples that aren't in the training set.

Â Then this will give me a subset of that data that are

Â just for training and a subset of the data that adjust for testing.

Â And you can do this with sort of a simple interface.

Â 3:17

Next you can fit a model, so here I'm going to use the

Â train command from the caret package, and so again I'm trying to predict type.

Â And I use the tilde and the dot to say use all

Â the other variables in this data frame in order to predict the type.

Â And I tell which data set I want to build the training model on,

Â and so, in this case, the training data set we created on the previous slide.

Â And then I just tell which of the methods that I'd like to use, and

Â so you can use GLM or you can use a bunch of other different models.

Â And so what this does is it'll create a model fit

Â from the train function, and as we use the 3451 samples

Â in a training set and the 57 predictors to predict which

Â class you're belonging to based on a model, a GLM model.

Â And so what it can do is it can do a bunch of different ways

Â of testing whether this model will work well

Â and using it to select the best model.

Â And in this case it used resampling.

Â And it does bootstrapping with 25 replicates, and it corrects

Â for the potential bias that might come from bootstrap sampling.

Â 4:16

So once we fit that model, we can actually look at the model, and so

Â the way we can do that is look

Â at the finalModel component of the modelFit object.

Â And the way you do that is you take the modelFit object,

Â and then you type dollar sign and then always the same finalModel.

Â It will tell you what are the actual

Â fitted values that you got for that GLM model.

Â 4:36

Then you predict on new samples by using the predict command.

Â Again, it's a unified framework, so we just type predict.

Â We pass it the modelFit that we got from the train, function in

Â carrot, and we pass it which data we would like it to predict on.

Â So in this case, the new data is the testing data.

Â When you do that, it will give you

Â a set of predictions that correspond to the responses,

Â and you can use those to try to evaluate

Â whether your model fit works very well or not.

Â One way that you can do that is by calculating the confusion matrix,

Â so that's using this confusion matrix function, and so note the capital M here.

Â Don't miss that when you're typing confusion matrix.

Â Then you pass in the predictions that you got from your model fit.

Â And then the actual outcome on the testing samples.

Â So in this case, it was the type or whether it was spam or ham message.

Â And then it will record the confusion matrix.

Â So it'll tell you a table for which of the cases that you predicted to be nonspam or

Â actually nonspam, which is the cases where it was

Â spam, and you predicted to be spam and so forth.

Â And then it gives you a bunch of summary statistics.

Â So for example, the accuracy, a 95 percent confidence interval for the accuracy,

Â and then a bunch of information about how well they correspond in other categories.

Â So, for example, the sensitivity and the specificity of that.

Â So the confusion matrix function wraps a bunch of different accuracy measures

Â that you might want to get out when you're evalutating the model fit.

Â For a lot more information about caret, we're going to cover a lot of it

Â in this class in terms of how do you actually apply the caret package.

Â But I found that these tutorials are actually very nice, and they can

Â be very useful for covering material that we don't cover in this class.

Â And there's also a very nice paper in the journal of

Â statistical software that introduces the caret

Â package if you want further information.

Â