0:00

In the last lecture we learned about in sample

Â and out of sample errors as well as over fitting.

Â In this lecture we'll talk about prediction study design or how to minimize

Â the problems that can be caused by in sample verses out of sample errors.

Â So in prediction study design the first thing that

Â we need to do is to find our error rate.

Â So in this lecture we're just going to use a generic error rate, but in the

Â next lecture we'll talk about what are

Â the different possible error rates you can choose.

Â 0:21

Then you need to split your data, the data that you're going to be using

Â to build and validate your model into three

Â components, well, two components and one optional component.

Â So there's a training set that must be created

Â in order to build your model, a testing set to

Â evaluate your model, and optionally a validation set as well,

Â which is also going to be used to validate your model.

Â So what you do is on the training

Â set, you pick features using cross-validation, for example.

Â We'll talk about what cross-validation is later in

Â the course, but the idea is basically to

Â use that training set to pick which of

Â the features that are most important in your model.

Â Then use that same technique to actually pick the prediction function

Â and estimate all the parameters that you might be interested in.

Â We build a model using the training set.

Â If there's no validation set, then we apply the best

Â model that we have to our test set exactly one time.

Â And so the, why do we only apply it one time?

Â If we applied multiple models to our testing set, then, and pick the best

Â one, then we're using the test set, in some sense, to train the model.

Â In other words, we're still getting an optimistic view of

Â what the data error would be on a completely new dataset.

Â So you should apply the prediction model to the test set exactly one time.

Â 1:31

If there's a validation set and a test set then you might apply your

Â best prediction models all to your test set and refine them a little bit.

Â So what you might find is that some features don't work so well when you're

Â doing out of sample prediction and you might

Â refine and adjust your model a little bit.

Â But now, again, like I said, your test set error is going to be

Â a little bit optimistic error for what your actual out of sample error will be.

Â And so what we do is we again, apply our model to exactly

Â one time to the validation set, only the best one to get our prediction.

Â So in other words, the idea is that there is one dataset that's

Â held out from the very start, that you only apply exactly one model

Â to, and you never do any training or tuning or testing to, and

Â that will give you a good estimate of your out of sample error rates.

Â 2:15

So an important point to keep in mind when

Â you doing this is to know what the benchmarks are.

Â So this is an example of a leaderboard from a [UNKNOWN] competition.

Â And so they often give you benchmarks, marking grey here on this plot

Â that shows what happens if you say make all the values equal to 0.

Â And it's a good idea to know

Â what the prediction benchmark are because sometimes if

Â you prediction algorithm is performing way better than it should be or way worse than

Â it should be, the benchmarks will give you some idea of what you should be

Â doing wrong, and they'll tell you if

Â your prediction algorithm is kind of going astray.

Â 2:49

So this is the studied design that was used in the Netflix prize, so they had

Â a 100 million ident-user-ident pairs, in other words,

Â movies that people had given their preference on.

Â They split that into a training data set, which they were going

Â to give to, users that we're going to build models for prediction.

Â Then, they held out, a bunch of ratings that we're not going

Â to be shown at all to the people that are building the models.

Â And so, what they provided for people who were building motels,

Â was a training set, and what they called a probe data set.

Â So, you would train your model on the training data

Â set and then you would apply it to the probe

Â data set to get some idea of what the out

Â of sample error would be, before submitting it to Netflix.

Â Then what Netflix would do is they would take your

Â predictions and they would apply it only to a quiz set.

Â So this quiz set you didn't get at get to actually see, and you couldn't build your

Â model on, but if we give you some better

Â idea of how well your model would perform at

Â a sample, but in general people could submit multiple

Â submissions to this quiz set and so they might

Â actually tune their models a little bit and get

Â a little bit better performance on the quiz set.

Â And so what they did ultimately was for the

Â final evaluation of all the different teams, they applied the

Â model just one time to this test set that

Â was held out to the very end of the competition.

Â And so at the very end of the competition

Â they got an unbiased estimate of how well this model

Â would work on a completely new set of data that

Â the participants in the competition never had a look at.

Â 4:15

So this idea was actually very important in their study design and it

Â actually turned out that some teams that did better on the quiz set, actually

Â didn't do quite as well on the test set and that was because they

Â were tuning in their models or over fitting their models to the quiz set.

Â So this is an important take home message for

Â you, in the sense that if you're building prediction

Â models, you always have to hold one data set,

Â and leave it completely aside while you're building your models.

Â This is now used by group professionals so

Â Kaggle is a company that actually runs lots

Â of co, competitions for prediction competitions for a

Â whole bunch of different data sets provided by companies.

Â And they always do a similar model in the sense that they always have a leader board

Â that consists of the predictions that people have

Â submitted for a validation data set that they have.

Â But it's not necessarily the data set that they'll

Â use to validate their methods at the very end.

Â That data set is held out until the very end, and each

Â person gets to apply their algorithm to that data set only one time.

Â 5:09

Something to keep in mind is that when you're splitting your

Â data sets up into training, testing and validation sets, they can

Â get a little bit small, but you need to avoid small

Â sample sizes, particularly if you're dealing with the test set size.

Â And the reason why is, suppose you were predicting a binary outcome, so in my

Â case, a very common thing to try to do is to predict diseased versus healthy.

Â And in general, it might be something like, whether people will

Â click on an ad, or whether they won't click on an ad.

Â Then one classifier is just flipping a coin.

Â You could always just flip a coin, and say, they'll be diseased if

Â the coin is heads, and not diseased if the coin comes out tails.

Â And so the probability of a perfect classification, using this

Â really silly algorithm is one half raised to the sample size.

Â In other words, half the time you'll be

Â right, just by chance by flipping the coin.

Â And each time, supposing each prediction is independent,

Â then each time that you flip a coin, then

Â you'll get one half times that, a number will

Â be the decrease in accuracy that you would get.

Â So if you were pr-, test set has only one sample in

Â it, then you have about a 50/50 chance of getting that sample right.

Â So, even if you got prediction accuracy of 100% on the test set,

Â you would have a 50% chance of that, even with a coin flip.

Â With n equals 2, you only have a, you

Â still only have a 25% chance of 100% accuracy.

Â And with n equals 10 in your test set, now

Â you have, only about a .1% chance of getting 100% accuracy.

Â So if you see that 100% accuracy you'll feel a little bit

Â more confident that it's actually true and it's not something that's just random.

Â So this suggests that we should make sure

Â that especially our test sizes are of relatively

Â large size so we can be sure that

Â we're not just getting good prediction accuracy by chance.

Â 6:54

So some rules of thumb, these are by

Â no means set in stone, but they are reasonable

Â rules of thumb that I've used and I think a lot of people have used similar ones.

Â So you set, when you get a new data set,

Â if it's large enough, you'll set 60% of your data set

Â to be training, 20% of your data set to be

Â test, and 20% of your valid, data set to be validation.

Â This is again assuming that your test and validation

Â data sets won't be too small if you do that.

Â If you have a medium sample size what you might do is you might take 60%

Â of your data set to be training and 40% of your data set to be testing.

Â This means you don't get to refine your models in

Â a test set and then apply them to a validation set.

Â But it might insure that your testing site is of sufficient size.

Â Finally, if you have a very small sample size.

Â First of all, you might reconsider whether you have enough samples

Â to be able to build a prediction algorithm in the first place.

Â But suppose your dead set on building a prediction or machine learning

Â algorithm, then the idea might be to do cross validation and report the

Â caveat of the small sample size and the fact that you never got

Â to predict this in an out of sample or a testing data set.

Â So

Â 7:55

some principles to remember are the test set or the validation set

Â should be set aside and never looked at when building your model.

Â In other words, you would need to have one

Â data set which you apply only one model to, only

Â one time, and that data set should be completely

Â independent of anything you use to build the prediction model.

Â In general you want to randomly sample the training and test set and

Â random might depend on the type of sampling that you want to do.

Â So for example, if you have, time fit time force

Â data, in other words you have, data collected over time, you

Â might want to build your, training set in chunks of time, but

Â again, random chunks of time and build them on random predictions.

Â Your data set much reflect the structure of the problem.

Â In other words, if you want to sample any data set that might have sources

Â of dependence over time or across space, you need to sample your data in chunks.

Â This is called backtesting in finance.

Â And it's basically the idea that you want to be able

Â to use chunks of data that consist of observations over time.

Â All subsets should re, reflect as much diversity as possible.

Â If you do random assignment, it does this.

Â You might also try balancing by features.

Â This can be a little bit tricky, but it often is a useful idea.

Â