0:00

This is a brief lecture about some of the training control options that you

Â have when training while using the caret package.

Â For this lecture, we're going to be using the spam example again just to illustrate

Â how these ideas work.

Â So we load the caret package, the kernlab package, and

Â then we attach the spam dataset.

Â Then we use the createDataPartition function to

Â create a set of endices corresponding to the training set.

Â And we set about 75% of the data to be in the training set.

Â Then we define training and testing sets using those indicator functions.

Â So usually what you would do when you fit a model is basically just use

Â the train function, like this, where you basically set all of the defaults to be

Â whatever defaults that the train function chooses for you.

Â Other than maybe the method that you're going to be using to fit and

Â which data set you're going to be using.

Â 0:45

You can go a little bit further than this, though.

Â For example, you can use a large set of options for training.

Â So, here are a couple of them.

Â One, you can use this preProcess parameter to set a bunch of preprocessing options.

Â We'll talk about that in a future lecture.

Â You can also set weights.

Â In other words, you can upweight or downweight certain observations.

Â These are particularly useful if you have very unbalanced training set where you

Â have a lot more examples of one type than another.

Â You can set the metric, so by default for factor variable, in other words for

Â categorical variables the default metric is accuracy that it's trying to maximize.

Â For continuous variables it's the root mean squared error,

Â like we talked about in a previous lecture.

Â 1:40

So the Metric options are built-in to the train function for continuous outcomes.

Â Our RMSE, or root mean squared error.

Â You can also use RSquared.

Â This is the RSquared that you get from a regression model if

Â you remember that from the inference class.

Â RSquared is a measure of linear agreement between the variables that you're

Â predicting and the variables that you predict with.

Â 2:11

Accuracy is the fraction correct.

Â So that's just the number that you get correct.

Â And that's the default measure of accuracy for categorical outcomes.

Â You can also tell it to use Kappa, which is a measure of concordance.

Â I've linked here to a definition of that measure.

Â It's a more in-depth, more complicated measure that's frequently used in some

Â competitions like and other competitions.

Â 2:35

The trainControl argument allows you to be much more precise about

Â the way that you train models.

Â You can tell it which method to use for resampling the data whether it's

Â bootstrapping or cross validation which we'll talk about in a minute.

Â You can tell it the number of times to do bootstrapping or cross validation.

Â You could also tell it how many times to repeat that whole process if you want to

Â be careful about repeated cross-validation.

Â You can tell it the size of the training set with this p parameter, and

Â then you can tell it a bunch of other parameters that depend on the specific

Â problems you're working on.

Â So for example, for

Â time course data, initial window tells you the size of the training dataset,

Â the size of the number of time points that will be in the training data.

Â And horizon is the number of time points that you'll be predicting.

Â You can also have it return the actual predictions themselves from each of

Â the iterations when it's building the model.

Â You can also have it return a different summary,

Â use a different summary function than the default summary if you'd like it to.

Â And then you can set preprocessing options as well as things like prediction bounds,

Â and you can set the seeds for all the different resampling layers.

Â This is particularly useful if you're going to be parallelizing your

Â computations across multiple cores.

Â We're not going to cover that too extensively here, but if you're training

Â models on large numbers of samples with a high number of predictors,

Â using parallel processing can be highly useful for

Â proving the computational efficiency of your analysis.

Â For resampling, there are a bunch of methods that are offered,

Â so this is again passed to the trainControl function.

Â You can use standard bootstrapping, you can use bootstrapping that adjusts for

Â the fact that multiple samples are repeatedly

Â resampled when you're doing that subsampling.

Â This will reduce some of the bias due to bootstrapping.

Â You can use cross validation which is a method that we've talked about in

Â previous lectures.

Â You could also use repeated cross validation if you want to do

Â sub cross validation with different random draws.

Â You could use leave one out cross validation and

Â remember there's a bias during its trade off between using

Â large number of folds and smaller number of folds when doing cross validation.

Â You can also tell it the number of bootstrap samples or

Â the number of subsamples to take and the number of times to repeat that subsampling

Â if you're doing something like repeated cross validation.

Â All of these parameters can be set.

Â In general the defaults work pretty well,

Â but if you have large numbers of samples or you have a model that requires

Â fine tuning across a large number of parameters, you may want to increase for

Â example the number of cross-validation or bootstrap samples that you take.

Â 5:15

Finally, an important component of training these models is setting the seed.

Â So what I mean by that is most of these procedures rely on random

Â resampling of the data.

Â And if you rerun the protocol over again, or

Â rerun the code over again, you will get a slightly different answer because there

Â was a random draw that was created when you were doing cross-validation.

Â If you set a seed, a random number seed, what that'll do is that'll ensure that

Â the same random numbers get generated each time.

Â It's a little bit of a difficult concept to get your head around but

Â the idea is that the computer is generating pseudo random numbers.

Â And if you set the seed it will generate the same sequence of pseudo random

Â numbers again.

Â 6:19

So, here's an example of that.

Â If I set the seed using the set.seed function in R, and

Â then I give it a number, an integer number,

Â it will set a seed that's consistent with performance analysis.

Â So it will generate a set of random numbers that is consistent.

Â So then if I fit my model, like this, then when it generates bootstrap samples.

Â It will generate those bootstrap samples according to the random numbers that come

Â from this seed.

Â If I then reset the seed again, and fit the model again, now with a different

Â number, modelFit3 instead of modelFit2, then I will get exactly the same

Â bootstrap samples and exactly the same measures of accuracy back out again.

Â This is important when you're training models and

Â then you want to share your training data set with someone else and

Â ensure that you get the same answer when they run the same code.

Â There's more information about this in the caret tutorial which I think is very good.

Â And also in this document about model training and

Â tuning with the caret package.

Â