0:00

This lecture is about in sample and out of sample errors.

Â This is one of the most fundamental concepts that we deal with in machine

Â learning and prediction, and so it's worth

Â understanding the concept with a very simple example.

Â So in sample errors, the error you get on

Â the same data you used to train your predictor.

Â This is sometimes called resubstitution error in the machine learning literature.

Â And in sample error is always going to be a little bit optimistic,

Â from what the error is that you would get from a new sample.

Â And the reason why is, in your

Â specific sample, sometimes your prediction algorithm will tune

Â itself a little bit to the noise that you collected in that particular data set.

Â And so when you get a new data set, there'll be

Â different noise, and so the accuracy will go down a little bit.

Â So what we do is we look at this out of sample

Â error rate, this is sometimes called the generalization error in machine learning.

Â And so the idea is that once we build a

Â model on a sample of data that we have collected.

Â We might want to test it on a new sample, on a

Â sample collected by a different person or in a different time, in

Â order to be able to see what the sort of realistic expectation

Â of how well that machine running algorithm will perform on new data.

Â So almost always, out of sample errors is what you care about.

Â So if you see a reported error rate for date.

Â The error rate reported only on the data

Â where the machine-learning algorithm was built, you know

Â that's very optimistic, and it probably won't reflect

Â how the model will perform in real practice.

Â In sample error is always less than out of

Â sample error, so that's something to keep in mind.

Â And the reason is overfitting.

Â Basically, again, you're matching your algorithm to the data that you

Â have at hand, and you're matching it a little bit too well.

Â So sometimes you want to be able to give up a little bit of accuracy

Â in the sample you have, to be able to get accuracy on new data sets.

Â In other words, when the noise is a

Â little bit different, your algorithm will be robust.

Â 1:42

So just to show you a really simple example, I thought I'd show you

Â in sample versus out of sample error's with a kind of a trivial example.

Â So here's what I've done, I've taken this again I've gone

Â to the kernlab package and I looked at the spam data set.

Â Remember that was the data set where

Â we collected information about spam messages, or

Â messages from robot and things like that,

Â and HAM messages, messages we actually care about.

Â And what I do is I actually take a very small sample of that Spam data set.

Â I just take ten messages and what I do is I

Â basically look at whether you see a lot of capital letters.

Â So I'm basically looking at the average number of

Â capital letters that you observes in a particular email.

Â And so I've plotted the first ten examples here versus their index.

Â And so in red are all the spam messages, in black are all the ham messages.

Â And so you can see, for example, that some of the spam messages like this

Â one up here, have a lot more capital

Â letters than the ones that are ham messages.

Â That sort of makes sense intuitively.

Â So we might want to build a predictor, based on the average number of

Â capital letters, as to whether you are a spam message or you're a ham message.

Â So one thing that we could do is build a predictor that says if you have a

Â lot of capitals than you're a spam message, and

Â if you don't then you're a non spam message.

Â And here's what this rule could look like, you could say if you're above 2.7, per

Â capital average we're going to call you spam,

Â if you're below 2.40 you're classified as non-spam.

Â And then one more, one thing we can do is we can actually try

Â to train this algorithm very, very well to predict perfectly on this data set.

Â So if we go back to these, this plot of the

Â different values, you can see there's one spam message right down here

Â in the lower right hand corner, that is a little bit

Â lower than the highest non-spam value in terms of this capital average.

Â So we could build a prediction algorithm

Â that would capture that spam value as well.

Â And so what we would do then is, we would make a

Â rule here that just picks out that one value in the training set.

Â It says if you're between 2.4 and 2.45, you're called spam as well.

Â And that's designed to basically make the training set accuracy perfect.

Â And you can see if we apply this rule

Â to the training set, we actually do get perfect accuracy.

Â So if you're nonspam, we perfectly classify you as nonspam,

Â and if you are spam, we perfectly classify you as spam.

Â 4:02

An alternative rule would not train quite so tightly to

Â the training set, but would still use the basic principle of

Â if you have a high number of capital letters then your

Â spam message, and so this rule might look something like this.

Â If you're above 2.40, your cap, your spam message, if

Â you're less than or equal to 2.40, then you're nonspam message.

Â So this rule on the training set would then miss that one value.

Â In other words, you could have a prediction of nonspam for that one

Â spam message that was just a little bit lower in our training set.

Â So overall, this looks like in that training set that the accuracy is

Â a little bit lower for this rule, and it's a little bit more simplistic.

Â So then we can apply it to all the spam data.

Â In other words apply it to all the values, not just the values that we

Â had in the small training set, and these are the results that you would get.

Â So this is a table of our predictions on the

Â the rows here, and in the columns that's the actual values.

Â And so you can see the number of errors that we make are the

Â errors here that are on the off-diagonal

Â elements of this little matrix that we created.

Â So those are the number of errors that we made, made.

Â And so what we can look at is that we can actually look

Â at the average number of times that were right using our more complicated rules.

Â So this is just the sum of the times that our

Â prediction is equal to the actual value in the spam data set.

Â And so that happens 3,366 times in this data set.

Â And then we could also look at the

Â more simplified rule, the rule where we just used

Â a threshold, and also look at the number of

Â times that that's equal to the real spam type.

Â And you can see about 30 more times, we actually

Â get the right answer when we use this more simplified rule.

Â So, what's the reason that the simplified rule

Â actually does better than the more complicated rule?

Â And the reason why is over fitting.

Â So, in every data set we have two parts, we have

Â the signal component, that's the part we're trying to use to predict.

Â And then we have noise, so that's just random variation in

Â the dataset that we get, because the data are measured noisily.

Â And so the goal of a predictor is to find a signal and ignore the noise.

Â And in any small dataset, you can always build a

Â perfect in-sample predictor just like we did with that spam dataset.

Â You can always carve up the prediction space in this, in this

Â small data set, to capture every single quirk of that data set.

Â But when you do that, you capture both the signal and the noise.

Â So for example, in that training set there was one stem value

Â that has slightly lower capital average than some of the non-span values.

Â But that was just because we randomly picked a data

Â set where that was true, where that value was low.

Â So that predictor won't necessarily perform as well on new samples,

Â because we've tuned it too tightly to the observed training set.

Â So, this lecture has two purposed.

Â One is to introduce you to the idea of in sample and out of sample errors.

Â In-sample errors are errors on the trainings that

Â we actually built with, and out of sample

Â errors are the errors on the data set

Â that wasn't used to build the training predictor.

Â And also we introduced to you this idea of over fitting.

Â In that we want to build models that are simple and robust enough that

Â they don't actually capture the noise, while they do capture all of the signal.

Â