In this video, I'd like to convey to you the main intuitions behind how

regularization works and we'll also write down the cost function that we'll use

when we're using regularization. With the hand drawn examples that we have

on these slides. I think I'll be able to convey part of

the intuition. But an even better way to see for

yourself how regularization works is if you implement it.

And sort of see it work for yourself. And if you do the programming exercises

after this. You get a chance to sort of see

regularization, in action, for yourself. So, here's the intuition.

In the previous video, we saw that if we were to fit a quadratic function to this

data, it would give us a pretty good fit to the data.

Whereas, if we were to fit an overly high order degree polynomial.

We end up with a [INAUDIBLE] that may fit the trading set very well.

But really not be not [INAUDIBLE] over fit the data poorly, and not analyze the

generalize well. Consider the following.

Suppose we were to penalize and make the parameters theta3 and theta4 really

small. Here's what I mean.

Here's our optimization objective, here's an optimization problem where we minimize

our usual squared error cost function. Let's say I take this objective and I

modify it and add to it plus 1,000. Theta three^2 + 1000, theta four^2.

1000. I'm just writing down as some, as some

huge number. Now, if we were to minimize this

function. Well, the only way to make this new cost

function small is if theta three and theta four are small, right?

Because otherwise, you know? If you have 1000 times theta three.

This, this, [INAUDIBLE] this new cost function's going to be [INAUDIBLE].

So when we minimize this new function, we're going to end up with theta three

close to zero. And then theta four, close to zero.

And that's as if we're getting rid of these two terms over there.

And, if we do that. Well then, if theta three and theta four

are close to zero, then we're basically left with a quadratic function.

And so it ends up with a fit to the data, that's, you know?

A quadratic function, plus maybe tiny contributions from small terms theta

three theta four. That may be very close to zero.

[INAUDIBLE]. [COUGH] And so we end up with,

essentially, a quadratic function, which is good.

Because this is a much better, hypothesis.

In this particular example, we looked at the effect of penalizing two of the

parameter values being launched. More generally, here's the idea behind

regularization. The idea is that.

If we have small values for the parameters.

Then having small values for the parameters.

Will somehow will usually correspond to having a simpler hypothesis.

So for our last example we penalized just theta3 and theta4 and when both of these

were close to zero we wound up with a much simpler hypothesis that was

essentially a quadratic function. But more broadly if we penalize all the

parameters usually that, we can think of that as trying to give us a simpler

hypothesis as well because when, you know, these parameters are close to zero

in this example that gave us a quadratic function, but more generally

It's possible to show that having smaller values of the parameters corresponds to

usually smoother functions as well, the less simpler and which are therefore less

prone to over fitting. I realize that the reasonning for why

having all the parameters be small, why Thancarus wants the simpler hypothesis, I

realize that reasonning may not be entirely clear to you right now, and it

is kind of hard to explain, unless you implement it yourself and see it for

yourself, but I hope that the example of having theta three and theta four be

small and how that gave us a simpler hypothesis I hope that helps explain why,

at least gives some intuition as to why this might be true.

Let's look at a specific example. For housing price prediction, we may have

a hundred features that we talked about where, maybe X1's the size, X2 is the

number of bedrooms, X3 is the number of floors, and so on.

And we may have a hundred features. And.

Unlike the polynomial example, we don't know, right?

We don't know that theta three, theta four are the high order polynomial terms.

So if we have just a bag. If we have just a set of 100 features.

It's hard to pick in advance, which are the ones that are less likely to be

relevant? So we have, you know, 100 or [INAUDIBLE]

parameters. And we don't know which ones to pick to.

We don't know which parameters to pick to try to shrink.

So in regularization, what we're going to do is take our cost function.

Here's my cost function for linear aggression.

And what I'm going to do is modify this cost function, to shrink all of my

parameters. Because you know, I, I don't know which

one or two to try to shrink. So I'm going to modify my cost function,

to add, a term at the end. Like so, and then we add square brackets

here as well, we're going to add an extra, regularization term at the end, to

shrink every single parameter and so this term would tend to shrink all of my

parameters theta one, theta two, theta three, up to, theta 100.

By the way, by convention. The summation here starts from one.

So I'm not actually going to penalize theta zero being large.

that's sort of a convention that the sum is from I equals one through N.

Rather than I equals zero through N. But in practice it makes very little

difference. And whether you include you know, theta

zero or not in practice it'll make very little difference in results.

But by convention usually we regularize only theta one through theta 100.

Writing down regularize optimization objective or

regularize cost function again, here it is, here is JF theta.

Where this term on the right is a regularization term.

And lambda here is called the irregularization parameter.