0:00

In this video, I'd like to

convey to you, the main intuitions

behind how regularization works.

And, we'll also write down

the cost function that we'll use, when we were using regularization.

With the hand drawn examples that

we have on these slides, I

think I'll be able to convey part of the intuition.

But, an even better

way to see for yourself, how

regularization works, is if

you implement it, and, see it work for yourself.

And, if you do the

appropriate exercises after this,

you get the chance

to self see regularization in action for yourself.

So, here is the intuition.

In the previous video, we saw

that, if we were to fit

a quadratic function to this

data, it gives us a pretty good fit to the data.

Whereas, if we were to

fit an overly high order

degree polynomial, we end

up with a curve that may fit

the training set very well, but,

really not be a,

but overfit the data

poorly, and, not generalize well.

0:57

Consider the following, suppose we

were to penalize, and, make

the parameters theta 3 and theta 4 really small.

Here's what I

mean, here is our optimization

objective, or here is our

optimization problem, where we minimize

our usual squared error cause function.

Let's say I take this objective

and modify it and add

to it, plus 1000 theta

3 squared, plus 1000 theta 4 squared.

1000 I am just writing down as some huge number.

Now, if we were to

minimize this function, the

only way to make this

new cost function small is

if theta 3 and data

4 are small, right?

Because otherwise, if you have

a thousand times theta 3, this

new cost functions gonna be big.

So when we minimize this

new function we are going

to end up with theta 3

close to 0 and theta

4 close to 0, and as

if we're getting rid

of these two terms over there.

2:03

And if we do that, well then,

if theta 3 and theta 4

close to 0 then we are

being left with a quadratic function,

and, so, we end up with

a fit to the data, that's, you know, quadratic

function plus maybe, tiny

contributions from small terms,

theta 3, theta 4, that they may be very close to 0.

And, so, we end up with

essentially, a quadratic function, which is good.

Because this is a

much better hypothesis.

In this particular example, we looked at the effect

of penalizing two of

the parameter values being large.

More generally, here is the idea behind regularization.

The idea is that, if we

have small values for the

parameters, then, having

small values for the parameters,

will somehow, will usually correspond

to having a simpler hypothesis.

So, for our last example, we

penalize just theta 3 and

theta 4 and when both

of these were close to zero,

we wound up with a much simpler

hypothesis that was essentially a quadratic function.

But more broadly, if we penalize all

the parameters usually that, we

can think of that, as trying

to give us a simpler hypothesis

as well because when, you

know, these parameters are

as close as you in this

example, that gave us a quadratic function.

But more generally, it is

possible to show that having

smaller values of the parameters

corresponds to usually smoother

functions as well for the simpler.

And which are therefore, also, less prone to overfitting.

I realize that the reasoning for

why having all the parameters be small.

Why that corresponds to a simpler

hypothesis; I realize that

reasoning may not be entirely clear to you right now.

And it is kind of hard

to explain unless you implement

yourself and see it for yourself.

But I hope that the example of

having theta 3 and theta

4 be small and how

that gave us a simpler

hypothesis, I hope that

helps explain why, at least give

some intuition as to why this might be true.

Lets look at the specific example.

4:12

For housing price prediction we

may have our hundred features

that we talked about where may

be x1 is the size, x2

is the number of bedrooms, x3

is the number of floors and so on.

And we may we may have a hundred features.

And unlike the polynomial

example, we don't know, right,

we don't know that theta 3,

theta 4, are the high order polynomial terms.

So, if we have just a

bag, if we have just a

set of a hundred features, it's hard

to pick in advance which are

the ones that are less likely to be relevant.

So we have a hundred or a hundred one parameters.

And we don't know which

ones to pick, we

don't know which

parameters to try to pick, to try to shrink.

So, in regularization, what we're

going to do, is take our

cost function, here's my cost function for linear regression.

And what I'm going to do

is, modify this cost

function to shrink all

of my parameters, because, you know,

I don't know which

one or two to try to shrink.

So I am going to modify my

cost function to add a term at the end.

5:36

By the way, by convention the summation

here starts from one so I

am not actually going penalize theta

zero being large.

That sort of the convention that,

the sum I equals one through

N, rather than I equals zero

through N. But in practice,

it makes very little difference, and,

whether you include, you know,

theta zero or not, in

practice, make very little difference to the results.

But by convention, usually, we regularize

only theta through theta

100. Writing down

our regularized optimization objective,

our regularized cost function again.

Here it is. Here's J of

theta where, this term

on the right is a regularization

term and lambda

here is called the regularization parameter and

what lambda does, is it

controls a trade off

between two different goals.

The first goal, capture it

by the first goal objective, is

that we would like to train,

is that we would like to fit the training data well.

We would like to fit the training set well.

And the second goal is,

we want to keep the parameters

small, and that's captured by

the second term, by the regularization objective. And by the regularization term.

And what lambda, the regularization

parameter does is the controls the trade of

between these two

goals, between the goal of fitting the training set well

and the

goal of keeping the parameter plan

small and therefore keeping the hypothesis relatively

simple to avoid overfitting.

For our housing price prediction

example, whereas, previously, if

we had fit a very high

order polynomial, we may

have wound up with a very,

sort of wiggly or curvy function like

this. If you still fit a high order polynomial

with all the polynomial

features in there, but instead,

you just make sure, to use

this sole of regularized objective, then what

you can get out is in

fact a curve that isn't

quite a quadratic function, but is

much smoother and much simpler

and maybe a curve like the magenta

line that, you know, gives a

much better hypothesis for this data.

Once again, I realize

it can be a bit difficult to see why strengthening the

parameters can have

this effect, but if you

implement yourselves with regularization

you will be able to see

this effect firsthand.

8:00

In regularized linear regression, if

the regularization parameter monitor

is set to be very large,

then what will happen is

we will end up penalizing the

parameters theta 1, theta

2, theta 3, theta

4 very highly.

That is, if our hypothesis is this is one down at the bottom.

And if we end up penalizing

theta 1, theta 2, theta

3, theta 4 very heavily, then we

end up with all of these parameters close to zero, right?

Theta 1 will be close to zero; theta 2 will be close to zero.

Theta three and theta four

will end up being close to zero.

And if we do that, it's as

if we're getting rid of these

terms in the hypothesis so that

we're just left with a hypothesis

that will say that.

It says that, well, housing

prices are equal to theta zero,

and that is akin to fitting

a flat horizontal straight line to the data.

And this is an

example of underfitting, and

in particular this hypothesis, this

straight line it just fails

to fit the training set

well. It's just a fat straight

line, it doesn't go, you know, go near.

It doesn't go anywhere near most of the training examples.

And another way of saying this

is that this hypothesis has

too strong a preconception or

too high bias that housing

prices are just equal

to theta zero, and despite

the clear data to the contrary,

you know chooses to fit a sort

of, flat line, just a

flat horizontal line. I didn't draw that very well.

This just a horizontal flat line

to the data. So for

regularization to work well, some

care should be taken,

to choose a good choice for

the regularization parameter lambda as well.

And when we talk about multi-selection

later in this course, we'll talk

about a way, a variety

of ways for automatically choosing

the regularization parameter lambda as well. So, that's

the idea of the high regularization

and the cost function reviews in

order to use regularization In the

next two videos, lets take

these ideas and apply them

to linear regression and to

logistic regression, so that

we can then get them to

avoid overfitting.