So in this lecture, we're just going to give a brief introduction to the content of course three, which is going to be all about trying to evaluate the machine learning systems we've developed so far. So a few topics that are going to come up when we try and evaluate regression classification models like those we developed will be the following: Essentially, how do we decide whether a model is a good model or a bad model? Maybe the notion of what good means is going to change depending on the context we care about. Well, critically, if we're trying to decide amongst several models which of them is better or which of them is the best model or the sum change we made to the model actually result in an improvement, how could we perform this type of comparison? Finally, there's going to come up this issue of generalization ability. So if we develop a model that works well on the data we use to train the model, how can we be sure that the model will still work well once we see new data that may be different somehow from the data we've already seen? So just to elaborate on each of those individual points, what does that really mean? In terms of evaluating models, how do we go about evaluating a regression or a classification system? Should we consider something like the average error the model makes, should we consider the average squared error the model makes? If it's a classifier, should we consider the accuracy, the amount of false positives? There are all these alternatives we could use to decide which model is best, which one is the right one in different situations, and what's the motivation behind these different choices. Secondly, how do we decide if a model is good and under what circumstances does the definition or meaning of good performance change? So consider a few examples like the following; if we were running a classifier to recognize fingerprints versus a classifier to detect pedestrians versus a classifier to determine whether there's a weapon or something dangerous in luggage at the airport. The notion of what makes a model good versus bad, might change in those different settings in ways that I'll describe later. Given each of these settings or different targets we might have in terms of what it means for model to be good, can we design a classifier that specifically targets one of these situations? Next, if we'd like to compare the performance of different models. So suppose we are given two different classifiers or two regressors and we'd like to decide which of them is better, how do we go about doing this? So if we've trained the two models on some data, is it enough to just say, well, which of those two models is more accurate or which of those regressors has low error? For example, if we were trying to make different decisions about design choices of our model, so we should use a one-hot encoding to represent time, should we do so at the level of weeks or months? Perhaps if we use a finer granularity, it's going to result in a model that has lower error or high accuracy in our training set, but does it actually mean the model is better, how can we correctly perform that kind of comparison? So that brings us to this last idea. If we have a model that works well on the data we use to train it, can we be sure that it will work well on data we haven't seen yet? So going back to the example from the previous slide, if I increase the level of granularity with my one-hot encoding, we might expect that with higher and higher levels of granularity we get a more and more complex model that's more and more capable of fitting the data we've already seen, but does that really make it better? So if it doesn't make it better, how can we develop a training regime that corrects that? So that's a quick summary of all the issues we'll face and in the rest of the course, we're going to try and develop solutions to those problems by extending the regression and classification models we've already developed.