0:19

1,000,000, 100,000,000 user item pairs, where its

Â movies that were rated by specific users.

Â And they break that up into a training data set, and then a test data set.

Â And remember that all of the model building and evaluation by that.

Â People building the model will happen on the training data, and then once, at

Â the very end of the competition, they would apply it to the test data set.

Â So one problem that comes up that very quickly, is that accuracy

Â on the training set is what's

Â called resubstitution accuracy is often optimistic.

Â In other words, we're always picking, we're trying a bunch of different

Â models, and we're picking the best one on the training set, and

Â that will always be tuned a little bit to the quirks of

Â that data set, and may not be the accurate representation of what that.

Â Prediction accuracy would be a, a new sample.

Â So, a better estimate comes from an independent data set.

Â So in this case say the test set accuracy.

Â But there's a problem that, if we keep using

Â the test set to evaluate the out of sample accuracy.

Â Then, in a sense, the test set has become part of the training set, and

Â we don't, still don't have an outside

Â measure, independent evaluation of the test set error.

Â So to estimate the test set accuracy, what we would like to use is, use

Â something about the training set, to get a good estimate of what the test set

Â accuracy will be, so then we can build our models entirely using the training set,

Â and only evaluate them once on the test set, just like the study design calls for.

Â So the way that people do that is cross validation.

Â So, the idea is, you take your training set, just the training samples.

Â And we split that train, we sub split that

Â training set into a training, and a test set.

Â Then we build a model on the training

Â set that's a subset of our original training set.

Â And evaluate on the test set, that's

Â a subset, again, of our original training set.

Â We repeat this over and over and average the estimated errors.

Â And that's something like.

Â Estimating what's going to happen in, when we get a new test set.

Â 2:05

So again, the idea is we take the training set, and we split

Â the training set itself up, into training and test sets over and over again.

Â Keep rebuilding our models, and picking the one that works best on the test set.

Â This is useful for picking variables to include in the model.

Â So, again, we can.

Â Now, fit a bunch of different models, with various different variables included,

Â and use the one that fits best on these cross validated test sets.

Â And then we can also pick the type of prediction function

Â to use, so we can try a bunch of different algorithms.

Â Again, pick the one that does best on the cross validation sets.

Â Or we can pick the parameters in the prediction function and estimate them.

Â Again, we can do all of this because, eventually,

Â even though we're, we're tr, sub splitting the training

Â set into a training set and a test set,

Â we actually leave the original test set completely alone.

Â Where it's never used in this process.

Â And so when we apply our ultimate prediction algorithm

Â to the test set, it'll still be an unbiased measurement.

Â Of what the auto sample accuracy will be.

Â So, the different ways that people can use training and test sets.

Â One example is they do random subsampling.

Â So, imagine that, every observation we're trying to

Â predict is arrayed out along the, this axis here.

Â And the color represents whether we include it in the

Â training set or testing set, so again, this is only

Â the training samples, and what we might do is just

Â take a subsample of them, and call them the testing sample.

Â So in this case, it's the light gray bars here, are

Â all the samples in this particular iteration that we call test samples.

Â Then we would build our predictor on all the dark gray samples.

Â And so, again, this is only within the training set.

Â We take the dark gray samples and build a model, and then

Â apply it to predict the light gray samples, and evaluate their accuracy.

Â We can do this, for several different random subsamples.

Â So this is a ra, one random sampling.

Â And then this second row is the second random sampling.

Â And this third row is the third random sampling.

Â Do that over and over again, and then average what the errors will be.

Â 3:59

Another approach that's commonly used is what's called K-fold cross validation.

Â So the idea here is we break our data set up into k equal size data sets.

Â So, for example, for three fold cross

Â validation, this is what it would look like.

Â So here's.

Â The first data set, right here, then there's a middle

Â data set right here, and the last third data set right

Â here, and so what we would do is on the first

Â fold, we would build a prediction model on this training data.

Â And we would apply it to this test data.

Â Then we would build a training our model on the, just the dark gray

Â components of this second fold, and apply it to the middle fold for evaluation.

Â Then finally, we would do the same thing down here.

Â We would build our model on this dark gray part, and apply it to the light gray part.

Â And again, we would average the errors that we

Â got across all of those experiments, and we would get

Â an estimate of [COUGH] the average error rate that

Â we would get in an out of sample estimation procedure.

Â So again, all of these model building and evaluation are happening

Â in, within the training set, which we've been subs divided into a.

Â Sub training set and a sub testing set to evaluate models.

Â 5:06

Another very common approach is called the leave

Â one off, out cross validation, and so here, we

Â just leave out exactly one sample, and we

Â build the predictive function on all the remaining samples.

Â And then we predict the value on the one sample that we left out.

Â Then we leave out the next sample, and build on all the remaining values.

Â And then predict the sample that we left out, and so forth

Â until we've done that for every single sample in our data set.

Â So again, this is another way to estimate the out of sample accuracy rate.

Â 5:35

So some consideration is, first of all, for time series data, this doesn't

Â work if you just randomly subsample the

Â population, you actually have to use chunks.

Â You have to get blocks of time that are

Â all contiguous, and that's because, one time point might

Â depend on all the time points that came previously,

Â and you're ignoring a huge, rich structure in the data.

Â If you just randomly take samples.

Â For k-fold cross validation, the larger the k that

Â you take, you'll get less bias, but more variance.

Â And the smaller k that you take, you'll get more bias, but less variance.

Â In other words, if you took a very large k, say for example a ten-fold cross

Â validation or a 20-fold cross validation, that means

Â you'll get a very accurate estimate of the.

Â 6:16

Of the bias between your predicted values, and your true values.

Â But it'll be highly variable.

Â In other words, it'll depend a lot on which random subsets that you take.

Â For smaller ks, we won't necessarily get as good

Â an estimate of the out of sample error rate.

Â And that's because, you're only leaving one sample out.

Â And so you're using most of your data to train your model.

Â But there'll be less variance.

Â In other words, if you do in the extreme case.

Â If you have for exampleonly two cross, two-fold

Â cross validation, there are only a very small.

Â Number of subsets that can make up a two-fold cross validation.

Â And so you get less variance.

Â Here, the randomless sampling must be done without replacement.

Â In other words, we're subsampling our data sets.

Â That's, of course a disadvantage, because, it means that we

Â have to break our training setup even further to smaller samples.

Â If you do random sampling with replacement, this is called the Bootstrap.

Â That's something that you've learned about in your early.

Â Inference classes, if you've taken those.

Â The bootstrap, in this particular example,

Â will in general, underestimate the error rate.

Â And the reason why is because if you do the bootstraps, you sample

Â with replacement from some of your samples,

Â some samples will appear more than once.

Â And so, in more, samples appear more than once, that means

Â that if you get one right, you'll definitely get the other right.

Â And so, you actually get underestimates of the

Â error rate, and this can be corrected but

Â its rather complicated, the way to do that,

Â is with something called the 0.632 Bootstap, which

Â is not exactly a great name for a method, but it explains, it sort of explains

Â how you can account for the fact that you have this underestimate in the error rate.

Â You can do any of these approaches when you go models

Â with the care package like we'll be learning in this class.

Â If you cross validate to pick predictors, you again must estimate the errors on

Â an independent data set, in order to get a true out of sample value.

Â So in other words, if you do cross validation to predict

Â your model, or to pick your model, the cross validated error rates.

Â Since you always picked the best model, will not necessarily be a

Â good representation of what the real out of sample error rate is.

Â And the best way to do that, is again, by applying

Â your prediction function, just one time, to an independent test set.

Â [SOUND]

Â