0:08

In the past few weeks, you've been using cross-validation to estimate training

error, and you have validated the selected model on a test data set.

Validation and cross-validation are critical in the machine learning process.

So it is important to spend a little more time on these concepts.

As we noted in the Buy Experience Tradeoff video,

estimation of the best statistical model and the training set,

capitalizes on random sample-specific patterns and associations among variables.

One of the challenges in machine learning is figuring out which model is best.

So how do we know which model is the best model?

0:46

We need to be able to estimate the test error which is the estimate of the error

in a model when it is tested on different observations.

We can then select the model that has the smallest test error.

There are different ways of estimating the test error.

One way to do this is to randomly split the data into training and test or

validation data sets.

The model is developed on the training data set,

then applied to the test data set to predict the values of the target for

the response variable for the observations in the test data set.

This is called the validation set approach,

while easy to implement the validation set approach has a couple of drawbacks.

First, the test error estimate can be highly variable depending on

which observations are included in the training and test data set.

Because the model is estimated on only a single data set.

And then validated on only a single data set again.

Second, because we end up splitting the observations into two data sets,

the model is developed on only a subset of the data.

Statistical methods generally perform worse when there are fewer observations,

leading to greater estimation error in the training data set.

Which leads to poor performance of the statistical model in the test data set.

To address these drawbacks, we can use Cross Validation.

The goal of Cross Validation is to define a data set to test the model

during the training phase.

It involves partitioning the training data set into subsets,

where one subset is held out to test the performance of the model.

This data set is called the validation data set.

2:15

There are different cross value validation methods.

To Leave One Out Cross Validation Method,

holds out one observation from the training set for validation.

The statistical model is fit on the rest of the observations, and the response for

the single observation is predicted based on the values of the predictors.

And the regression coefficients from the model estimated on the n-1 observations.

Then the process is repeated by holding out a different observation for

validation and training the data on the other observations.

Because the test error is based on only a single observation,

it is highly variable and is therefore a poor estimate of the true test error.

But, if we repeat this process for

every observation each time holding out a different validation observation and

using the rest of the observations to train the model, then we will end up

with as many test error estimates as observations in the full data set.

These individual test error estimates can be averaged

to get an overall test error estimate.

The advantage to using Leave One Out Cross Validation is that the regression

coefficients will have less bias because the model travel is fit in all but

one observation in the data set.

3:29

In addition, unlike the single validation set approach, the parameter estimates will

not vary as a result of how the data is split in to training and test data sets.

Because the test error is estimated multiple times and then averaged.

The disadvantage is that because the model is fit n times where n is equal to

the observations n the data sets, Leave One Out Cross Validation approach

can be time consuming and computationally intensive.

Especially in large data sets.

K-fold Cross Validation is a kind of compromise between a validation set and

leave one out cross validation approaches.

4:21

Then, the error for each of the fold is average,

where the model with the smallest amount of error is selected.

One major advantage to K-fold Cross Validation over Leave One Out Validation

is that it requires considerably less computational resources.

Because rather than fitting the statistical model as many times as

the number of observations in your data set, you fit it,

a substantially smaller number of times, typically less than 20 times.

Some statistical learning methods have computationally intensive fitting

procedures and data sets can have an extremely large number of observations.

This makes leave one out cross validation less feasible.

So, K-fold Cross Validation is a nice compromise between single data set

validation and leave one out cross validation.

In addition, K-fold Cross Validation often provides more

accurate estimates of the test error rate, than does leave one out cross validation.

Again, this has to do with the bias variance trade-off.

We know that the validation set approach can overestimate the test error rate,

because the training set will have only a proportion

of the number of observations in the full data set.

In the leave one out cross validation approach,

the training data set will have only one less observation than the full data set so

it can provide essentially unbiased estimates of the test error rate.

The leave one out cross validation approach is actually superior for

providing less biased estimates of the test error rate.

In the leave one out cross validation approach,

the training dataset will have only one less observation than the full dataset.

So it can provide essentially unbiased estimates of the error rate.

The leave one out cross validation approach is actually superior for

providing less biased estimates of the test error rate.

But bias is not the only thing we're concerned about.

We're also concerned about variance.

When it comes to having less variance in the test error rate,

the K-fold approach to the leave one out, cross validation approach.

Leave one, out cross validation has higher variance

than does K-fold Cross Validation.

This is because in leave one out cross validation,

the n-1 training data sets contain pretty much the same observations each time.

As a result, the estimates calculated in each cross-validation sample

will be highly correlated with each other, and

the mean of these highly correlated estimates will have greater variance.

With K-fold validation there's considerably less overlap

in the cross-validation samples,

which means less correlation between the cross-validation estimates.

And consequently, less variance.

For many statistical methods, cross validation is

easily conducted with procedures for functions that will do it automatically.

We just need to specify the type of cross validation.

7:08

If we decide to go with the k-fold cross validation approach,

then we have to specify the number of folds.

The number of folds can vary but

you will typically see k-fold cross validation with k=5 or k=10.

There is a bias variant trade off associated with the choice of how many

folds to specify in k-fold cross validation.

Using k=5, or k=10,

has been found to estimate test error rate with low bias and variants.

That is why these values of k, are often used.