0:18

The functionality that's built into the care package are some of the following.

So, for example, we can use the

preprocessing tools in the caret package to clean

data and get the features set up, so that they can be used for prediction.

We can also do, sort of cross validation and data splitting

within the training set, using the

create DataPartition and create TimeSlices functions.

We can also create training and test sets with the training and predict functions.

And we can use those to train data sets at

train prediction functions and apply them to new data sets.

We can also do model comparison using the confusion matrix function, which will

give you information about how well the model's did on new data sets.

0:57

There a large machine learning algorithms that are built into R, so

these range from very popular statistical

machine learning algorithms like linear discriminant

analysis and regression to much more

widely used algorithms in computer science,

like support vector machines, classification and

regression trees, or random forests or boosting.

All of these algorithms are built by

a variety of different developers, all coming from

different backgrounds, so the interfaces that each of

these sort of prediction algorithms is slightly different.

As one example of this, consider

this class of different prediction algorithms that

you could have applied, so everything

from linear discriminate analysis down to boosting.

And so for each of these different algorithms,

you can imagine creating an object called objnr.

That object will have a different class, say

a linear discriminate analysis or glm and so forth.

And for each of these objects, we can try

to predict, and if we apply the predict function, we

have to put it pass slightly different parameters each

time in order to get the prediction of the outcome.

So, for example, from the GLM package, we have to say type

equals response to get the prediction of the response from that model fit.

Or, for example, if we want to use rpart, we want

to predict with type equals probability in order to predict the response.

In each case, they're a little bit different, and the caret

package provides a unifying framework that allows you to predict using just

one function and without having to specify all the options that

you might care about in order to get the same prediction out.

2:22

So here's a quick example, using the caret package.

We'll go into the details of how this is done very specifically in later examples.

So here we've loaded in the caret package, and we've loaded

the kern lab package as well to get the spam data set.

And so, what we can do first is partition

the data setup into a training and a test set.

Here I'm going to use the spam type, and I'm going to

split it into the, training set and the test set.

I'm going to say we're going to use about 75% of

our data to train the model and 25% to test.

Then what I can do is I can actually subset the data

into the training data using the in train, bit, the in train, object

that comes out from create data partition, and I can create the testing

data set by finding all those samples that aren't in the training set.

Then this will give me a subset of that data that are

just for training and a subset of the data that adjust for testing.

And you can do this with sort of a simple interface.

3:17

Next you can fit a model, so here I'm going to use the

train command from the caret package, and so again I'm trying to predict type.

And I use the tilde and the dot to say use all

the other variables in this data frame in order to predict the type.

And I tell which data set I want to build the training model on,

and so, in this case, the training data set we created on the previous slide.

And then I just tell which of the methods that I'd like to use, and

so you can use GLM or you can use a bunch of other different models.

And so what this does is it'll create a model fit

from the train function, and as we use the 3451 samples

in a training set and the 57 predictors to predict which

class you're belonging to based on a model, a GLM model.

And so what it can do is it can do a bunch of different ways

of testing whether this model will work well

and using it to select the best model.

And in this case it used resampling.

And it does bootstrapping with 25 replicates, and it corrects

for the potential bias that might come from bootstrap sampling.

4:16

So once we fit that model, we can actually look at the model, and so

the way we can do that is look

at the finalModel component of the modelFit object.

And the way you do that is you take the modelFit object,

and then you type dollar sign and then always the same finalModel.

It will tell you what are the actual

fitted values that you got for that GLM model.

4:36

Then you predict on new samples by using the predict command.

Again, it's a unified framework, so we just type predict.

We pass it the modelFit that we got from the train, function in

carrot, and we pass it which data we would like it to predict on.

So in this case, the new data is the testing data.

When you do that, it will give you

a set of predictions that correspond to the responses,

and you can use those to try to evaluate

whether your model fit works very well or not.

One way that you can do that is by calculating the confusion matrix,

so that's using this confusion matrix function, and so note the capital M here.

Don't miss that when you're typing confusion matrix.

Then you pass in the predictions that you got from your model fit.

And then the actual outcome on the testing samples.

So in this case, it was the type or whether it was spam or ham message.

And then it will record the confusion matrix.

So it'll tell you a table for which of the cases that you predicted to be nonspam or

actually nonspam, which is the cases where it was

spam, and you predicted to be spam and so forth.

And then it gives you a bunch of summary statistics.

So for example, the accuracy, a 95 percent confidence interval for the accuracy,

and then a bunch of information about how well they correspond in other categories.

So, for example, the sensitivity and the specificity of that.

So the confusion matrix function wraps a bunch of different accuracy measures

that you might want to get out when you're evalutating the model fit.

For a lot more information about caret, we're going to cover a lot of it

in this class in terms of how do you actually apply the caret package.

But I found that these tutorials are actually very nice, and they can

be very useful for covering material that we don't cover in this class.

And there's also a very nice paper in the journal of

statistical software that introduces the caret

package if you want further information.