0:00

In this video, I'd like to convey to you the main intuitions behind how

Â regularization works and we'll also write down the cost function that we'll use

Â when we're using regularization. With the hand drawn examples that we have

Â on these slides. I think I'll be able to convey part of

Â the intuition. But an even better way to see for

Â yourself how regularization works is if you implement it.

Â And sort of see it work for yourself. And if you do the programming exercises

Â after this. You get a chance to sort of see

Â regularization, in action, for yourself. So, here's the intuition.

Â In the previous video, we saw that if we were to fit a quadratic function to this

Â data, it would give us a pretty good fit to the data.

Â Whereas, if we were to fit an overly high order degree polynomial.

Â We end up with a [INAUDIBLE] that may fit the trading set very well.

Â But really not be not [INAUDIBLE] over fit the data poorly, and not analyze the

Â generalize well. Consider the following.

Â Suppose we were to penalize and make the parameters theta3 and theta4 really

Â small. Here's what I mean.

Â Here's our optimization objective, here's an optimization problem where we minimize

Â our usual squared error cost function. Let's say I take this objective and I

Â modify it and add to it plus 1,000. Theta three^2 + 1000, theta four^2.

Â 1000. I'm just writing down as some, as some

Â huge number. Now, if we were to minimize this

Â function. Well, the only way to make this new cost

Â function small is if theta three and theta four are small, right?

Â Because otherwise, you know? If you have 1000 times theta three.

Â This, this, [INAUDIBLE] this new cost function's going to be [INAUDIBLE].

Â So when we minimize this new function, we're going to end up with theta three

Â close to zero. And then theta four, close to zero.

Â And that's as if we're getting rid of these two terms over there.

Â And, if we do that. Well then, if theta three and theta four

Â are close to zero, then we're basically left with a quadratic function.

Â And so it ends up with a fit to the data, that's, you know?

Â A quadratic function, plus maybe tiny contributions from small terms theta

Â three theta four. That may be very close to zero.

Â [INAUDIBLE]. [COUGH] And so we end up with,

Â essentially, a quadratic function, which is good.

Â Because this is a much better, hypothesis.

Â In this particular example, we looked at the effect of penalizing two of the

Â parameter values being launched. More generally, here's the idea behind

Â regularization. The idea is that.

Â If we have small values for the parameters.

Â Then having small values for the parameters.

Â Will somehow will usually correspond to having a simpler hypothesis.

Â So for our last example we penalized just theta3 and theta4 and when both of these

Â were close to zero we wound up with a much simpler hypothesis that was

Â essentially a quadratic function. But more broadly if we penalize all the

Â parameters usually that, we can think of that as trying to give us a simpler

Â hypothesis as well because when, you know, these parameters are close to zero

Â in this example that gave us a quadratic function, but more generally

Â It's possible to show that having smaller values of the parameters corresponds to

Â usually smoother functions as well, the less simpler and which are therefore less

Â prone to over fitting. I realize that the reasonning for why

Â having all the parameters be small, why Thancarus wants the simpler hypothesis, I

Â realize that reasonning may not be entirely clear to you right now, and it

Â is kind of hard to explain, unless you implement it yourself and see it for

Â yourself, but I hope that the example of having theta three and theta four be

Â small and how that gave us a simpler hypothesis I hope that helps explain why,

Â at least gives some intuition as to why this might be true.

Â Let's look at a specific example. For housing price prediction, we may have

Â a hundred features that we talked about where, maybe X1's the size, X2 is the

Â number of bedrooms, X3 is the number of floors, and so on.

Â And we may have a hundred features. And.

Â Unlike the polynomial example, we don't know, right?

Â We don't know that theta three, theta four are the high order polynomial terms.

Â So if we have just a bag. If we have just a set of 100 features.

Â It's hard to pick in advance, which are the ones that are less likely to be

Â relevant? So we have, you know, 100 or [INAUDIBLE]

Â parameters. And we don't know which ones to pick to.

Â We don't know which parameters to pick to try to shrink.

Â So in regularization, what we're going to do is take our cost function.

Â Here's my cost function for linear aggression.

Â And what I'm going to do is modify this cost function, to shrink all of my

Â parameters. Because you know, I, I don't know which

Â one or two to try to shrink. So I'm going to modify my cost function,

Â to add, a term at the end. Like so, and then we add square brackets

Â here as well, we're going to add an extra, regularization term at the end, to

Â shrink every single parameter and so this term would tend to shrink all of my

Â parameters theta one, theta two, theta three, up to, theta 100.

Â By the way, by convention. The summation here starts from one.

Â So I'm not actually going to penalize theta zero being large.

Â that's sort of a convention that the sum is from I equals one through N.

Â Rather than I equals zero through N. But in practice it makes very little

Â difference. And whether you include you know, theta

Â zero or not in practice it'll make very little difference in results.

Â But by convention usually we regularize only theta one through theta 100.

Â Writing down regularize optimization objective or

Â regularize cost function again, here it is, here is JF theta.

Â Where this term on the right is a regularization term.

Â And lambda here is called the irregularization parameter.

Â 6:23

And what lambda does is it controls a tradeoff between two different goals.

Â The first goal captured by the first term in the objective is that we would like to

Â train, is that we would like to fit the training data well.

Â We would like to train-, fit the training set well.

Â And the second goal is, we want to keep the parameters small.

Â And that's captured by the second term. By the regularization objective and by

Â the regularization term. And what lambda, the regularization

Â parameter does is it controls the trade off between these two goals.

Â Between the goal of fitting the trading set well, and the goal of keeping the

Â parameters small, and therefore keeping the hypothesis relatively simple to avoid

Â overfitting. For our housing price prediction example.

Â Whereas previously, if we have a very high order polynomial.

Â We may have wound up with a very wavy or curvy function like this.

Â If you still fit a high order polynomial with all the polynomial features in

Â there. But instead, you just make sure to use

Â this sort of regularized objective. Then, what you can get out is in fact, a

Â curve that. Isn't quite a quadratic function, but is

Â much smoother and much simpler. And maybe a curve like the magenta line

Â that, you know, fits. That, gives a much better hypothesis for

Â this data. Once again, I realize it can be a bit

Â difficult to see why shrinking the parameters can have this effect.

Â But if you implement this algorithm yourself with regularization you will be

Â able to see this effect firsthand. In regularized linear regression if the

Â parameter, if the regularization parameter londer is set to be very large

Â then what will happen is we will end up penalizing the parameters theta1, theta2,

Â theta3, theta4 very highly, that is if our hypothesis is this one down at the

Â bottom. At the end of penalizing theta one, theta

Â two, theta three, theta four very heavily then we end up with all these parameters

Â close to zero. Right, theta one will be close to zero,

Â theta two will be close to zero, theta three, and theta four will end up being

Â close to zero. And if we do that it's as if we're

Â getting rid of these terms in the hypothesis so that we're just left with a

Â hypothesis. That, we'll say that, it says that, well

Â housing prices are equal to theta zero, and that is a hint to fitting a flat

Â horizontal straight line to the data, and this.

Â Is an example of under fitting. And in particular, this hypothesis, this

Â straight line that just fails to fit the training set well.

Â It's just a flat straight line. It doesn't go, you know, go near, it

Â doesn't go anywhere near most of out training examples.

Â And another way of saying this, is that. This hypothesis has too strong a

Â preconception or too high a bias that housing prices are just equal to theta

Â zero. And despite the clear data to the

Â contrary you know, chooses to fit a sort of flat line, just a flat horizontal line

Â I didn't draw that very well. Fits just a horizontal flat line to the

Â data. So for regularization to work well.

Â some care should be taken to choose a good choice for the regularization

Â parameter lambda as well. And when we talk about we multiselection

Â later in this course we will talk about the way or variety of ways for

Â automatically choosing the regularization parameter lambda as well.

Â So, that's the idea behind regularization and the cost function we'll use in order

Â to use regularization. In the next two videos let's take these

Â ideas and apply them to linear regression and to logistic regression, so that we

Â can then get them to avoid over fitting problems.

Â