In this lecture, we're going to introduce the concept of overfitting and we're going to expand our previous discussion of training and test sets from the previous course. Okay, so we're going to introduce the idea of overfitting consider the following setting. We have some dollar points just drawn on an x and y-axis here and would like to say, which is the model that provides the best possible fit to this data? What kind of model is going to make the best predictions if you like? So thinking as we have been about linear models, we might say, okay, this is a reasonably good prediction. We can fit the data by a line. It sort of approximately follows this general trends. It maybe is deviating from the data points just by a little bit so that's a fairly good fit. Well, I might argue that a better fit to this data would be something sort of crazy like a 20th-degree polynomial, all right? So I can draw some polynomial curve here that sort of fits my data points exactly. Maybe very close sometimes and very far away at other times. But if I choose a polynomial high enough degree, I can certainly fit as many data points as I want perfectly. So we would say this polynomial function has a lower error. And therefore, maybe it's a better fit to the line. Okay, so that maybe seems like an incorrect conclusion. The polynomial function might be the best fit to the data in terms of the mean-squared error we introduced, but our intuition tells us that data like that probably is not really following a 20th-degree polynomial. It seems like too complex a solution. Maybe the line wasn't perfect. We need something a little more complex, but we don't need something in sort of a so wild and deviating so much like a 20th-degree polynomial. Okay, but why is it not the case? So if we fit this more complex function, if we throw in more random features, if we fit high degree polynomials, we are reducing are mean-squared error or we are getting an R squared statistic closer to one. So we're getting closer and closer to perfect prediction by following this high degree polynomial coefficients or random features into our model. So why is it not the right thing to do? Well, the basic reason is that we should be evaluating this mean-squared error or R squared statistic or any other evaluation metric we might introduce on data that was not used to train the model. So another way of putting that is that even though our polynomial curve followed the data very closely, we somehow don't expect that when we observe new data. It will really correspond to that point in your curve anymore. It will probably be closer to the line or some kind of simpler function. So another way of putting that is to say rather than just computing the mean-squared error or R squared statistic, we should select a good model by finding one that generalizes well to new data. So which model do we actually expect will work on data that was not used to train the model. So to do this, what we're going to do is to split up our data set into a training and a testing portion. The testing portion is going to simulate observing our model on the same data. So we'll train the model using the first portion and we'll test the model using the second portion. Thinking about that in terms of something like our linear regression equations. We might say, we have this matrix x and our vector of labels y. We can split that up just by taking some fraction of the rows of x and calling that our training set, and then the first corresponding fraction on the values of y, and also on training labels. And the remaining values of x and the remaining values of y would be our test set, and we'll give examples of how that really works in code later on. So all we have to do is the following. We split our data into a training and testing component. Training set is used to tune the model parameters. In other words, the training set is used to select the parameter values theta based on the training features x and the training labels y and then we evaluate that model on the unseen data. So we use the test portion to say, how well does that value of theta actually work if we apply to data that was not used to select theta? And that gives us an estimate of a more generalization performance or it's ability to make predictions on new data. Okay, so some techniques about how we should select training in test sets. The first cardinal rule is to say the training and test sets should be non overlapping samples of the data. You should not be training on any data you're using for any evaluation later on or you'll not accurate estimate the model's generalization ability. Each of them should be a random sample of the data. So it would be dangerous say to use a lot of old data as a training set and then newer data as the test set, because the newer data could have different characteristics compared to the old data. You'll be better off taking random or similar looking samples of the data to build the training and the test set. It would like to measure generalization ability of the model. We would like to be looking at data that is different are following the same general distribution as the training data. Okay, and there are a few rules we might consider when we’re choosing the size of our training versus our test sets. The first is that the training sets should be large enough, so that we don't overfit too badly. For example, if we had a model with 50 features or 50 parameters, then maybe you'd want in order of magnitude more training examples like 500 or so training examples to ensure that we're not overfitting to a very limited amount of data. Secondly, we'd like a reasonably large test set in order to balance these two ideas. In other words, we'd like a test set that kind of captures the range and variance that our data exhibits. If we have only a very small number of instances in our test set, they could be outliers that don't really accurately measure the model's overall generalization ability. Finally, there might be other constraints like running time that just says, well, a model takes a very long time to train. So we can't train around a huge data set and that might limit the size of the training set we could select. Okay, so that's a quick introduction to the concept of training and test sets and this idea of a model's generalization ability which we might use as a more robust measure of a model's performance rather than looking at the main-square error on the training set. Okay, and other than that, we showed specifically how a training and test set can be used to estimate a model's generalization ability. And later on, we'll illustrate these same concepts with some code.