Okay. In this lecture, having discussed the issue of different evaluations schemes we might use for regression models, we're going to talk about one of the most common evaluation measures that's often used which is called the mean squared error. We're going to try and introduce some probability to motivate the choice behind this particular error measure. So basically the purpose of this lecture is, on the one hand, just to introduce one of the error measures which are commonly used and perhaps a more advanced topic to try and give the mathematical motivation behind that. Okay. So this is the basic question we're trying to answer. If we have some regression model, how should we go about evaluating it? So we might have a picture that looks something like this. We have some data. In this case, it's height versus weight measurements. We've tried to fit a line that approximately fits that data, and of course, it doesn't fit it exactly. There's some error or some deviation between the data points and the line of best fit. So how do we measure that error or how do we come up with an evaluation metric that says whether this line is good or bad? So the error literally speaking is just going to mean the difference between the correct value and the prediction a model made. That would be given by equation like this which is y_i which is the value or the label of the ith data point minus X_i.Theta which is the value of our model's prediction. Okay. So how do we then characterize the error across the entire model? Well, one of the most common error measures we would use is what's called the mean squared error. So the first equation here is just the common mathematical notation to write down the mean squared error. The bottom equation is just a more simple way of writing the same thing. So all this going on here is that we're summing over all of the predictions made by our model from I equals one to N, looking at the ith label, comparing it to the model's prediction, and then we're squaring that value and averaging the results. So averaging how large is the squared deviation of the model's prediction from the labels. Okay. So I've said this is a commonly used error measure. Why is that? Why would we use the mean squared error and not something else like the mean absolute error? So mean absolute error, for example, might seem like a very obvious way to measure the error of the model. I'm just counting how large in absolute value are those errors, difference between the models label and the model's prediction. Instead, we are squaring those values. So why do we do that? It might seem unnatural in a way. So if we would use the mean absolute error, we would be saying if I observe a small error, I incur a small penalty. If I observe a large error, I incur a large penalty. We're not really saying that. By taking the mean squared error, we squaring that penalty. We're saying if I observe a small error, I incur a small penalty and if I observe a large error, I square that so I observe a really huge penalty. Okay. So what's going on here? Well, we've said that our label is equal to our prediction plus some error of the model. So what might be useful to do is to characterize what kind of errors are likely or what kind of errors are unlikely. When I said on the previous slide, we are giving very large penalties to large errors and small penalties to small errors. What I really mean is that large errors must be much, much rarer than small errors. So small errors are going to be common. Large errors are going to be increasingly less common. So to make that a bit more of a formal concept, we can think about probability distributions. If we were to draw a plot like this which says how large is my error, it cannot be positive or negative. So I'm just plotting on the X axis the value of my label minus my prediction. On the Y axis, I can say how common is an error of that magnitude. What I've said here is that small errors should be very common, large areas should be uncommon. So what that really means is that the error is going to follow some shape like this. I don't know if I've drawn this shape perfectly, but that's basically what's called a bell-curve or a normal distribution. It's saying small errors are common, large errors are uncommon. So in a particular position on this plot, I have the error for one data point. Then, the corresponding value of this distribution is the probability observing an error that large. Okay. Then, this equation down the bottom is just writing that down in terms of what's called a normal distribution. So this is a common distribution that characterizes this bell-curved shape. Okay. So when I make the statement that my errors might approximately follow a bell-curve, small errors are common, large errors are uncommon, what I really might be saying is the error distribution follows a Gaussian or a normal or a bell-curved shape. I can write that down as this equation down the bottom. So it says my label is equal to my prediction which is X_i.Theta plus some error which is generated by this Gaussian distribution. Okay. What really is a Gaussian distribution? I drew this bell-curve on the previous slide, but in mathematical terms, it corresponds to the following equation. So in order to have a probability of an error following a Gaussian distribution, I can write down this probabilistic expression which says, "What is the probability of a label given some particular features?" It's given by this fairly complex and ugly-looking equation that really just that corresponds to the shape of a bell-curve. But you might also already see how this corresponds a little bit that means good error function. So if we look at the top right of this equation, we see this turn that looks like our error squared. Okay. We're dotting the product over all these probabilities to get the probability of observing a sequence of errors across our entire data set. All right. So if we think about choosing the model that would maximize that probability, what that would really mean is selecting the model such that the errors we observe are most consistent with that bell-curved shape. So we should choose the value of theta such that we observe lots of small errors and very few large errors. So we can think about maximizing this expression on the top of the slide and when we just care about maximizing it, some of these constants from that equation disappear. Second trick here is that if we're maximizing it, well, maximizing a product of terms is going to be the equivalent of maximizing the sum of the logs of those terms. So we can convert this to a summation and that causes the exponential component of the function disappear. Maximizing is equivalent to minimizing the negative of the same expression. There you can see the equation I'm ultimately left with is just minimizing the mean squared error. Okay. So what I've really shown in this lecture is the relationship between the mean squared error and probability distributions. I've said that if we have this assumption, the small errors are common and the large errors are very uncommon, or the error distributions, if you want to call them that, follow a bell-curved shape, then we will arrive at the mean squared error because it corresponds to having errors following a bell-curve distribution which has some specific expression. If we try to maximize the probability of that expression, that corresponds to minimizing the mean squared error. Okay. So I know this is a fairly advanced lecture, but really at a high level the concept we're trying to illustrate here is the relationship between the likelihood of a model, we can write down our regression problem in terms of probability distributions, in terms of bell-curves that describe likely versus unlikely errors. If we can do that, we can say, "Well, what is the most likely model that we would observe following this distribution of errors?" That gives us this idea of minimizing an error or minimizing a mean squared error. Okay. So long story short. That's why the mean squared error which we'll use throughout this course is used so commonly as opposed to alternatives like the mean absolute error. So on your own, you might want to take some of the models we've developed so far and compute things like the mean squared error or mean absolute error and related statistics and see how these things relate to each other. So what kind of circumstances will you have high mean squared error and low mean absolute error or vice versa?