0:00

This is a brief lecture about some of the training control options that you

have when training while using the caret package.

For this lecture, we're going to be using the spam example again just to illustrate

how these ideas work.

So we load the caret package, the kernlab package, and

then we attach the spam dataset.

Then we use the createDataPartition function to

create a set of endices corresponding to the training set.

And we set about 75% of the data to be in the training set.

Then we define training and testing sets using those indicator functions.

So usually what you would do when you fit a model is basically just use

the train function, like this, where you basically set all of the defaults to be

whatever defaults that the train function chooses for you.

Other than maybe the method that you're going to be using to fit and

which data set you're going to be using.

0:45

You can go a little bit further than this, though.

For example, you can use a large set of options for training.

So, here are a couple of them.

One, you can use this preProcess parameter to set a bunch of preprocessing options.

We'll talk about that in a future lecture.

You can also set weights.

In other words, you can upweight or downweight certain observations.

These are particularly useful if you have very unbalanced training set where you

have a lot more examples of one type than another.

You can set the metric, so by default for factor variable, in other words for

categorical variables the default metric is accuracy that it's trying to maximize.

For continuous variables it's the root mean squared error,

like we talked about in a previous lecture.

1:40

So the Metric options are built-in to the train function for continuous outcomes.

Our RMSE, or root mean squared error.

You can also use RSquared.

This is the RSquared that you get from a regression model if

you remember that from the inference class.

RSquared is a measure of linear agreement between the variables that you're

predicting and the variables that you predict with.

2:11

Accuracy is the fraction correct.

So that's just the number that you get correct.

And that's the default measure of accuracy for categorical outcomes.

You can also tell it to use Kappa, which is a measure of concordance.

I've linked here to a definition of that measure.

It's a more in-depth, more complicated measure that's frequently used in some

competitions like and other competitions.

2:35

The trainControl argument allows you to be much more precise about

the way that you train models.

You can tell it which method to use for resampling the data whether it's

bootstrapping or cross validation which we'll talk about in a minute.

You can tell it the number of times to do bootstrapping or cross validation.

You could also tell it how many times to repeat that whole process if you want to

be careful about repeated cross-validation.

You can tell it the size of the training set with this p parameter, and

then you can tell it a bunch of other parameters that depend on the specific

problems you're working on.

So for example, for

time course data, initial window tells you the size of the training dataset,

the size of the number of time points that will be in the training data.

And horizon is the number of time points that you'll be predicting.

You can also have it return the actual predictions themselves from each of

the iterations when it's building the model.

You can also have it return a different summary,

use a different summary function than the default summary if you'd like it to.

And then you can set preprocessing options as well as things like prediction bounds,

and you can set the seeds for all the different resampling layers.

This is particularly useful if you're going to be parallelizing your

computations across multiple cores.

We're not going to cover that too extensively here, but if you're training

models on large numbers of samples with a high number of predictors,

using parallel processing can be highly useful for

proving the computational efficiency of your analysis.

For resampling, there are a bunch of methods that are offered,

so this is again passed to the trainControl function.

You can use standard bootstrapping, you can use bootstrapping that adjusts for

the fact that multiple samples are repeatedly

resampled when you're doing that subsampling.

This will reduce some of the bias due to bootstrapping.

You can use cross validation which is a method that we've talked about in

previous lectures.

You could also use repeated cross validation if you want to do

sub cross validation with different random draws.

You could use leave one out cross validation and

remember there's a bias during its trade off between using

large number of folds and smaller number of folds when doing cross validation.

You can also tell it the number of bootstrap samples or

the number of subsamples to take and the number of times to repeat that subsampling

if you're doing something like repeated cross validation.

All of these parameters can be set.

In general the defaults work pretty well,

but if you have large numbers of samples or you have a model that requires

fine tuning across a large number of parameters, you may want to increase for

example the number of cross-validation or bootstrap samples that you take.

5:15

Finally, an important component of training these models is setting the seed.

So what I mean by that is most of these procedures rely on random

resampling of the data.

And if you rerun the protocol over again, or

rerun the code over again, you will get a slightly different answer because there

was a random draw that was created when you were doing cross-validation.

If you set a seed, a random number seed, what that'll do is that'll ensure that

the same random numbers get generated each time.

It's a little bit of a difficult concept to get your head around but

the idea is that the computer is generating pseudo random numbers.

And if you set the seed it will generate the same sequence of pseudo random

numbers again.

6:19

So, here's an example of that.

If I set the seed using the set.seed function in R, and

then I give it a number, an integer number,

it will set a seed that's consistent with performance analysis.

So it will generate a set of random numbers that is consistent.

So then if I fit my model, like this, then when it generates bootstrap samples.

It will generate those bootstrap samples according to the random numbers that come

from this seed.

If I then reset the seed again, and fit the model again, now with a different

number, modelFit3 instead of modelFit2, then I will get exactly the same

bootstrap samples and exactly the same measures of accuracy back out again.

This is important when you're training models and

then you want to share your training data set with someone else and

ensure that you get the same answer when they run the same code.

There's more information about this in the caret tutorial which I think is very good.

And also in this document about model training and

tuning with the caret package.