All right. In this lecture, we're going to introduce the concept and some different notions of model complexity, and show how we can use these notions to encourage our model to favor simpler over more complex solutions. We'll further discuss a concept introduced previously called overfitting. So really, what that means is we talked about overfitting before and how it could be dangerous to choose models that are extremely complex like very high degree polynomials over simpler models. What we're really trying to do in this lecture is introduce some more concepts that can actually say, well, if I want to fit more simple models rather than more complex models, is there some measurement I can make that actually says how complex is a particular solution, and maybe I can try to penalize that. Reminding ourselves about overfitting, when we have a model that works well on training data but doesn't generalize to new data, that's what we call overfitting, and that's what we try to avoid. We'd like to penalize model complexity in order to avoid overfitting. What can we do to avoid overfitting in practice? I've said that we'd like to penalize model complexity in order to avoid overfitting. If we would like to actually achieve this in practice, what we really need is some cost function that says, how complex is a particular model choice. Then we would like to train a model that balances two goals: For one, the model should be reasonably accurate. In other words, it should have a low mean squared error on the data I was using it to train it. But secondly, the model should not be too complex. Again, going back to that example I showed about a very high degree polynomial versus a straight line or a linear function. Well, the high degree polynomial will have high accuracy but very high complexity whereas a linear function would have reasonable accuracy and very low complexity. So we could balance these two notions against each other in the right way. We might favor a linear model or something close to a linear model over something more complex like a polynomial function. The real question then is, how do we come up with a measurement that says whether a particular model is simple or complex? Can we take some model like a linear function or a high degree polynomial and write down some equation which says, here's a score associated with those two models which will help us to decide which of the two is actually simpler? We're thinking about linear models for most of these lectures. In other words, we have a model that looks like the following; we have some feature vector, in this case, it might be a high degree polynomial. We have a constant term, a linear term, a quadratic term, a cubic term, and so on and so forth. Each of our features is associated with some parameter vector or our Thetas. So a measure of complexity somehow should come from looking at the values of each of these Theta coefficients. If we think about what a model corresponding to a high degree polynomial might look like, the values for each of our Theta coefficients might be given by this plot on the bottom left. Here, we have various indices, i, and the corresponding values for Theta_i on the y-axis, and because this is a high degree polynomial, many of those terms would be non-zero. We have a non-zero value for the offset, a non-zero value for the linear term, a non-zero value for the quadratic term, and so on and so forth. Alternatively, if we just had a linear function, then our coefficients would look like this; this plot on the bottom left now just says there's an offset term, so a non-zero value for Theta_0, and there's a linear term, so a non-zero value for Theta_1, and all the other terms will be zero. That's how a linear function would look like. So in other words, if we would like to say the high degree polynomial function is more complex than the linear function, one way of doing it would just be to say, well, one of those two functions has more non-zero coefficients associated with it. So we can just count the number of non-zero parameters in Theta. That might be our naive or obvious definition of model complexity. That's one definition of model simplicity. Another commonly used definition might be to say, what is the amount of variance in the model parameters. So not only does the model on the left corresponding to the high degree polynomial have many parameters that are non-zero, but actually, the values of those parameters are very large. So we have very large coefficients associated with many of the terms in this polynomial. We might say that simpler then a model which has very small coefficients associated with all of those terms. There's two notions of model complexity that we might use. One notion says a simple model has few non-zero parameters. Another notion might say that a simple model has parameters that are centered around zero. So if we have those two notions, how can we actually write down an equation that precisely measures them and then penalize these different notions of complexity? I'll just give away the answer now, which hopefully is reasonably easy to read, and then I'll actually give a more complex proof as to why these notions apply. But if we would like to choose models that have a small number of non-zero parameters, our notion of complexity might be the sum of absolute values of those parameters, whereas if we would like to have a model which has parameters that are quite small, the notion of complexity that we might use is what you'd normally call the magnitude of that vector. So the sum of the squares of all values take the square root of that term, or equivalently, you could ignore the square root and say we'd just like to penalize the sum of squares of all values. Those two error measures are called the l_1 and l_2 norms, if you refer back to some of the notation given at the beginning of the course. Those are two functions we might try to penalize. We would say we want small values for these two functions, so a small sum of absolute values or small sum of squared values if we would like to encourage one of these two notions of complexity. Being a little bit hand-wavy, why do these two different functions correspond to different notions of complexity? We consider a very simple example like this one; this is a simple linear model with two features. You are trying to estimate height based on whether someone's gender is female and whether somebody has long hair. So this is a model with two features, both of which are fairly well correlated with height, but more critically, those two features are highly correlated with each other. They're basically encoding, roughly speaking, two copies of the same information. In this case, maybe we have two alternative ways we could come up with an equivalent model. One would be to say, let's take these two features or these two parameters, Theta_1 and Theta_2, and balance the weight between them. So that's the model on the left. Another way would be to say, one of those features might be slightly more predictive than the other, so maybe the gender is more predictive than hair length. So we'll just take that feature alone and ignore the other one, and that'll be the model on the right. Now, we can compare those models in terms of their l_2 and l_1 norm. So if we consider the sum of squared values, the model on the left has a much smaller sum of squared values than the model on the right. If you think about just taking two values and you're assigning weight between them, the sum of squares is going to be lowest when those values are approximately equal to each other, and it's going to be very large when you have one big and one small value. On the other hand, the l_1 norms or the sum of absolute values of those two models are equivalent. What that means is that if you're penalizing the sum of squared values, you're going to favor the model on the left because it's cheap or simple in terms of its sum of squared values. But if you're penalizing the sum of absolute values, you might favor the model on the right just because it makes slightly more accurate predictions. If you have one feature which is slightly better than another, then your best off re-balancing all of your weight in that feature and ignoring the other feature completely. So that's a very rough proof for what's actually quite a deep and difficult result. To cut the long story short, if we penalize the l_1 norm or sum of absolute values, we'll probably end up selecting something like model 2, whereas if we penalize the l_2 norm, we'll select something like model 1. Again, just to roughly summarize, penalizing the l_1 norm is going to lead to sparse features with few non-zero terms, penalizing the l_2 norm is going to give us a model with features that are approximately uniform. There's no real right answer to say one of those two notions is better than another. You don't really know which is a better definition of complexity or simplicity. It's really going to depend on the specific application as to which of those notions works better in terms of generalization performance. In the following lecture, we'll see how can we actually take those notions of model complexity and incorporate them into some model. So in this lecture, we introduced two different complexity measures for encouraging or penalizing model complexity in machine learning models, and we explained the consequences behind these different choices. On your own, I would suggest taking one of the models you've already developed from one of the previous lectures in this class and just saying what are the complexities of those models according to these l_1 and l_2 measures.