0:00

In this video, I'd like to convey to you the main intuitions behind how

regularization works and we'll also write down the cost function that we'll use

when we're using regularization. With the hand drawn examples that we have

on these slides. I think I'll be able to convey part of

the intuition. But an even better way to see for

yourself how regularization works is if you implement it.

And sort of see it work for yourself. And if you do the programming exercises

after this. You get a chance to sort of see

regularization, in action, for yourself. So, here's the intuition.

In the previous video, we saw that if we were to fit a quadratic function to this

data, it would give us a pretty good fit to the data.

Whereas, if we were to fit an overly high order degree polynomial.

We end up with a [INAUDIBLE] that may fit the trading set very well.

But really not be not [INAUDIBLE] over fit the data poorly, and not analyze the

generalize well. Consider the following.

Suppose we were to penalize and make the parameters theta3 and theta4 really

small. Here's what I mean.

Here's our optimization objective, here's an optimization problem where we minimize

our usual squared error cost function. Let's say I take this objective and I

modify it and add to it plus 1,000. Theta three^2 + 1000, theta four^2.

1000. I'm just writing down as some, as some

huge number. Now, if we were to minimize this

function. Well, the only way to make this new cost

function small is if theta three and theta four are small, right?

Because otherwise, you know? If you have 1000 times theta three.

This, this, [INAUDIBLE] this new cost function's going to be [INAUDIBLE].

So when we minimize this new function, we're going to end up with theta three

close to zero. And then theta four, close to zero.

And that's as if we're getting rid of these two terms over there.

And, if we do that. Well then, if theta three and theta four

are close to zero, then we're basically left with a quadratic function.

And so it ends up with a fit to the data, that's, you know?

A quadratic function, plus maybe tiny contributions from small terms theta

three theta four. That may be very close to zero.

[INAUDIBLE]. [COUGH] And so we end up with,

essentially, a quadratic function, which is good.

Because this is a much better, hypothesis.

In this particular example, we looked at the effect of penalizing two of the

parameter values being launched. More generally, here's the idea behind

regularization. The idea is that.

If we have small values for the parameters.

Then having small values for the parameters.

Will somehow will usually correspond to having a simpler hypothesis.

So for our last example we penalized just theta3 and theta4 and when both of these

were close to zero we wound up with a much simpler hypothesis that was

essentially a quadratic function. But more broadly if we penalize all the

parameters usually that, we can think of that as trying to give us a simpler

hypothesis as well because when, you know, these parameters are close to zero

in this example that gave us a quadratic function, but more generally

It's possible to show that having smaller values of the parameters corresponds to

usually smoother functions as well, the less simpler and which are therefore less

prone to over fitting. I realize that the reasonning for why

having all the parameters be small, why Thancarus wants the simpler hypothesis, I

realize that reasonning may not be entirely clear to you right now, and it

is kind of hard to explain, unless you implement it yourself and see it for

yourself, but I hope that the example of having theta three and theta four be

small and how that gave us a simpler hypothesis I hope that helps explain why,

at least gives some intuition as to why this might be true.

Let's look at a specific example. For housing price prediction, we may have

a hundred features that we talked about where, maybe X1's the size, X2 is the

number of bedrooms, X3 is the number of floors, and so on.

And we may have a hundred features. And.

Unlike the polynomial example, we don't know, right?

We don't know that theta three, theta four are the high order polynomial terms.

So if we have just a bag. If we have just a set of 100 features.

It's hard to pick in advance, which are the ones that are less likely to be

relevant? So we have, you know, 100 or [INAUDIBLE]

parameters. And we don't know which ones to pick to.

We don't know which parameters to pick to try to shrink.

So in regularization, what we're going to do is take our cost function.

Here's my cost function for linear aggression.

And what I'm going to do is modify this cost function, to shrink all of my

parameters. Because you know, I, I don't know which

one or two to try to shrink. So I'm going to modify my cost function,

to add, a term at the end. Like so, and then we add square brackets

here as well, we're going to add an extra, regularization term at the end, to

shrink every single parameter and so this term would tend to shrink all of my

parameters theta one, theta two, theta three, up to, theta 100.

By the way, by convention. The summation here starts from one.

So I'm not actually going to penalize theta zero being large.

that's sort of a convention that the sum is from I equals one through N.

Rather than I equals zero through N. But in practice it makes very little

difference. And whether you include you know, theta

zero or not in practice it'll make very little difference in results.

But by convention usually we regularize only theta one through theta 100.

Writing down regularize optimization objective or

regularize cost function again, here it is, here is JF theta.

Where this term on the right is a regularization term.

And lambda here is called the irregularization parameter.

6:23

And what lambda does is it controls a tradeoff between two different goals.

The first goal captured by the first term in the objective is that we would like to

train, is that we would like to fit the training data well.

We would like to train-, fit the training set well.

And the second goal is, we want to keep the parameters small.

And that's captured by the second term. By the regularization objective and by

the regularization term. And what lambda, the regularization

parameter does is it controls the trade off between these two goals.

Between the goal of fitting the trading set well, and the goal of keeping the

parameters small, and therefore keeping the hypothesis relatively simple to avoid

overfitting. For our housing price prediction example.

Whereas previously, if we have a very high order polynomial.

We may have wound up with a very wavy or curvy function like this.

If you still fit a high order polynomial with all the polynomial features in

there. But instead, you just make sure to use

this sort of regularized objective. Then, what you can get out is in fact, a

curve that. Isn't quite a quadratic function, but is

much smoother and much simpler. And maybe a curve like the magenta line

that, you know, fits. That, gives a much better hypothesis for

this data. Once again, I realize it can be a bit

difficult to see why shrinking the parameters can have this effect.

But if you implement this algorithm yourself with regularization you will be

able to see this effect firsthand. In regularized linear regression if the

parameter, if the regularization parameter londer is set to be very large

then what will happen is we will end up penalizing the parameters theta1, theta2,

theta3, theta4 very highly, that is if our hypothesis is this one down at the

bottom. At the end of penalizing theta one, theta

two, theta three, theta four very heavily then we end up with all these parameters

close to zero. Right, theta one will be close to zero,

theta two will be close to zero, theta three, and theta four will end up being

close to zero. And if we do that it's as if we're

getting rid of these terms in the hypothesis so that we're just left with a

hypothesis. That, we'll say that, it says that, well

housing prices are equal to theta zero, and that is a hint to fitting a flat

horizontal straight line to the data, and this.

Is an example of under fitting. And in particular, this hypothesis, this

straight line that just fails to fit the training set well.

It's just a flat straight line. It doesn't go, you know, go near, it

doesn't go anywhere near most of out training examples.

And another way of saying this, is that. This hypothesis has too strong a

preconception or too high a bias that housing prices are just equal to theta

zero. And despite the clear data to the

contrary you know, chooses to fit a sort of flat line, just a flat horizontal line

I didn't draw that very well. Fits just a horizontal flat line to the

data. So for regularization to work well.

some care should be taken to choose a good choice for the regularization

parameter lambda as well. And when we talk about we multiselection

later in this course we will talk about the way or variety of ways for

automatically choosing the regularization parameter lambda as well.

So, that's the idea behind regularization and the cost function we'll use in order

to use regularization. In the next two videos let's take these

ideas and apply them to linear regression and to logistic regression, so that we

can then get them to avoid over fitting problems.