0:07

When we build the supervised learning model,

Â whether it's for classification or regression,

Â it's not only important for the model to make

Â correct predictions on the examples it saw in the training data.

Â If that were the only definition of success,

Â then supervised learning algorithms could simply memorize all of

Â the training data and always have a 100 percent accuracy.

Â Of course, that's not what happens in the real world.

Â What we actually want from

Â a supervised machine learning algorithm is the ability to predict a class

Â or target value correctly on a test set of future examples that haven't been seen before.

Â This ability to perform well on

Â a held out test set is the algorithms ability to generalize.

Â How do we know that our trained supervised model will generalize well?

Â In other words, perform well on an unseen test set.

Â Well, typically machine learning makes

Â an important assumption that the future test set has the same properties,

Â or, more technically, as drawn from the same underlying distribution as the training set.

Â This means that if we observe our model has good accuracy on the training set,

Â and the training set and test sets are similar,

Â we can also expect that it will do well on the test set.

Â Unfortunately, sometimes this doesn't happen

Â due to an important problem known as 'overfitting'.

Â Informally, overfitting typically occurs when we try to fit

Â a complex model with an inadequate amount of training data.

Â And overfitting model uses its ability to capture complex patterns by being great at

Â predicting lots and lots of

Â specific data samples or areas of local variation in the training set.

Â But it often misses seeing global patterns in

Â the training set that would help it generalize well on the unseen test set.

Â It can't see these more global patterns because, intuitively,

Â there's not enough data to constrain the model to respect these global trends.

Â As a result, the training set accuracy is

Â a hopelessly optimistic indicator for

Â the likely test set accuracy if the model is overfitting.

Â Understanding, detecting, and avoiding overfitting is perhaps

Â the most important aspect of applying

Â supervised machine learning algorithms that you should master.

Â There are serious real world consequences to overfitting.

Â The reading for this week contains

Â one example from the medical world you might find interesting.

Â To understand the phenomenon of overfitting better.

Â Let's look at a few visual examples.

Â The first example that we'll look at for overfitting involves regression.

Â In this chart on the x axis,

Â we have a single input variable that might be,

Â for example, the size of a piece of property.

Â And we have a target variable on the y axis.

Â For example, this might be

Â that market selling price of a house that sits on that piece of property.

Â So in regression, we're looking to find

Â the relationship between the input variable and the target variable.

Â And we start off with the idea of

Â a model that we want to fit that might explain this relationship.

Â So, one example of a model that we could try to fit would be a linear model.

Â So trying to predict that there's

Â a linear relationship between the input variable and the target variable.

Â So, the regression model might fit a straight line to these points.

Â In this case the model has under-fit the data.

Â The model is too simple for the actual trends that are present in the data.

Â It doesn't even do well on the training points.

Â So these blue points would represent the training points,

Â the input to the training process for the regression.

Â And when we underfit, we have a model that's too simple,

Â doesn't even do well on the training data and thus,

Â is not at all likely to generalize well to test data.

Â On the other hand, if we pick a better model, for example,

Â the idea that this might be a quadratic relationship,

Â we might actually be able to apply that and get what looks like a much better fit.

Â So this would be example of a reasonably good model fit.

Â We've captured both the general trend of

Â the points while also ignoring the little variations that might exist due to noise.

Â A third example might be,

Â we might hypothesize that the relationship between

Â the input variable and the target variable is a function of several different parameters.

Â Let's say, a polynomial so something that's very bumpy.

Â If we try to fit a more complex model to this set of training data,

Â we might end up with something that looks like this.

Â So, this more complex model has more parameters so

Â it's able to capture more subtle behavior.

Â But it's also much higher variance here as you can see so it's more focused on capturing

Â the more local variations in the training data rather than trying to

Â find the more global trend that we can see as humans in the data.

Â So, this is an example of overfitting.

Â In this case there's not enough data to really constrain the parameters of the model

Â enough so that it can recognize the global trend.

Â The second example will look at overfitting in classification.

Â So, this diagram shows a simple two dimensional classification problem

Â where each data instance is represented by a point,

Â and where there are two features associated with each data instance.

Â This is a binary classification problem so points are labeled

Â either red over here or blue.

Â And the problem in classification is to find a decision boundary.

Â So as we saw with K nearest neighbors we want to take this feature space and

Â find regions that are associated with the different labels.

Â So, one type of classifier that we'll look

Â at shortly is called the linear classifier where we try

Â to find a decision body that's essentially a straight line through the space.

Â So just as in regression,

Â we can have underfitting models,

Â good fitting models, and overfitting models.

Â So, an example of an underfitting model would be something

Â that really didn't look much at the data.

Â Maybe something like this that only looked at one feature for example.

Â So again, this is an overly simplistic model.

Â It doesn't even capture the patterns.

Â The division between the classes in the training data

Â well and so would be unlikely to generalize so this in an underfitting model.

Â A reasonably good model that fits well would be.

Â You know, a linear model that finds

Â this general difference between

Â the positive class over here and the negative class over here.

Â So you can see that it's aware of this sort of global pattern of having most of

Â the blue negative points in the upper left

Â and most of the red positive points more toward the lower right.

Â And it's robust enough in the sense that it ignores the fact that there may

Â be occasionally a blue point in the red region or red point in the blue region.

Â Instead, it's found this sort of more global separation between these two classes.

Â So, this is a reasonably well fitting model.

Â An overfitting model on the other hand would

Â typically be a model that has lots of parameters so it can capture complex behavior.

Â And so, it would try to find something very clever,

Â where it tried to completely separate the red points from

Â the blue points in a way that resulted in a highly variable decision boundary.

Â So this has the advantage that, a questionable advantage,

Â that it does capture the training data classes very well.

Â It predicts the training data classes almost perfectly.

Â But as you can see, if the actual division between the classes,

Â it's captured by this linear model,

Â the overfit model is going to make lots of

Â mistakes in the regions where it's trying to be too perfect in some sense.

Â So, you'll see this typically with overfitting.

Â The overfit model is highly variable.

Â And again, it's trying to capture too many of

Â the local fluctuations and does not have enough data to see

Â the global trend that would result in

Â better overall generalization performance on new unseen data.

Â This third example shows the effect of modifying

Â the K parameter in the K nearest neighbors classifier that we saw previously.

Â The three plots shown here show the decision boundaries for K=10, K=5, and K=1.

Â And here we're using the fruit dataset again with the height of a piece of fruit

Â on the x axis and the width on the y axis.

Â So the K=10 case,

Â as you recall, K=10 means that,

Â for each query point that we want to get a prediction for,

Â let's say over here,

Â we have to look at the 10 nearest points to the query point and we'll take those.

Â We won't go through all 10 here,

Â but let's just say there are several here that are

Â nearby and will average or combine their votes to predict the label for this point.

Â So in the K=10 case,

Â we need the votes from

Â 10 different data instances in the training set to make our prediction.

Â And as we reduce K, so K=5,

Â we only need five neighbors to make a prediction.

Â So for example, if the query point was here,

Â we'd look at these four and then possibly whatever was the closest

Â in this let's say, that one or this one.

Â And then finally, in the K=1 case,

Â that's the most unstable in the sense that for any query point,

Â we only look at the single nearest neighbor to that point.

Â And so, the effect of reducing K in the k-nearest neighbors

Â classifier is to increase the variance of the decision boundaries,

Â because the decision boundary can be affected by outliers.

Â If there's a point far,

Â far away, it might have,

Â it has much greater effect on the decision boundary in the K=1 case,

Â than it would in the K=10 case,

Â when the votes of nine other neighbors are also needed.

Â And so, you can see that by

Â adjusting the value of K for the k-nearest neighbors classifier,

Â we can control in some sense,

Â the degree of model fitting that's appropriate for a dataset.

Â Now, the actual value of k that works best can

Â only be determined by evaluating on a test set.

Â But the general idea is that as we decrease K for k-NN classifiers,

Â we increase the risk of overfitting.

Â Because, for the reasons I mentioned before,

Â where K=1 for example,

Â we're trying to capture very local changes in

Â the decision boundary that may not lead to good generalization behavior for future data.

Â