In this case all of these parameters, theta one, theta two, theta three and so

on will be heavily penalized.

And so what ends up with most of these parameter values being close to zero.

And the hypothesis will be roughly h of x just equal or

approximately equal to theta zero.

And so we end up with the hypothesis that more or less looks like that.

It's a small lesser flat constant straight line and so

this hypothesis has high bias, and it badly underfits the status [INAUDIBLE].

Horizontal straight line's just not a very good model for this dataset.

At the other extreme is if we have a very small

value of lambda such as if lambda were equal to zero.

In that case, given that we're fitting a high-order polynomial,

this is our usual overfitting setting.

In that case given that we're fitting a high-order polynomial basically without

regularization or with very minimal regularization we end up with our usual

high variance overfitting setting.

It's basically if lambda is equal to zero we're just fitting it

with our regularization so that overfits the hypothesis.

And is only if we have some intermediate value of lambda that is neither

too large or too small that we end up with parameters theta that give us

a reasonable fit to this data.

So, how can we automatically choose a good value for

the regularization parameters of lambda.

Just to reiterate, here's our model and here's our learning algorithm objective.

For the setting where we're using regularization,

then we define J train of theta to be something different.

To be the optimization objective but without the regularization term.

Previously, in a earlier video when we were not using regularization

I define J train of theta to be the same as J of theta as a cross function.

But when we're using regularization when the term we're going to define J train,

to be just my sum of squared errors on a training set or my average squared error

on the training set without taking into account that regularization term.

And similarly then also we're going to define the cross-validation sets error and

the test sets error as before to be the average sum of script errors on

the cross-validation and the test sites.

So, just to summarize, my definitions of J train, Jcv and

J test are just the average square that are one half of the average square

on my training validation and test sets without the extra regularization term.

So this is how we can automatically choose the regularization parameter long term.

What I usually do is maybe have some range of values of lambda I want to try out.

So, I might be considering not using regularization.

Well, here are a few values I might try.

I might be considering lambda equal to 0.01, 0.02, 0.04, and so on.

And I usually set these up in multiples of two until some maybe larger value.

If I were doing this in multiples of two I should end up with 10.24

instead of 10 exactly.

But, this is close enough and the third or

fourth decimal places won't affect your result that much.

So this gives me maybe 12 different models that I'm trying to select amongst

corresponding to 12 different values of the regularization parameter lambda.

And of course you can also go to values less than 0.01 or values larger than ten,

but I've just truncated that here for convenience.

Definition of these top models we can do is then the following.

We can take this first model with lambda equal zero, and minimize my cost

function J of theta, and this will give me some parameter vector theta.

And similar to the earlier video, let me just denote this as theta superscript one.

[COUGH] And then I can take my second model.

With lambda set to 0.01 and minimize my cost function

now using lambda equals 0.01 of course to get some different parameter vector theta.

Limited delta theta two.

And for that I end up with theta three so if this is fair for my third model and so

on until the final model with lambda is set to ten when I put ten or

10.24 and I put this theta 12.

Next I can take all of these hypotheses, all of these parameters and

use my cross validation set to evaluate them.

So I can look at my first model,

my second model fits of these different values of the regularization parameter and

evaluate them when I cross-validation sets basically measure the average squared

error of each of these parameter vector theta of my cross-validation set.

And I would then pick which ever one of these 12 models gives me the lowest

error on the cross-validation set.

And let's say for the sake of this example, that I end up picking theta five.

The fifth order polynomial because that has the lowest cross-validation error.

Having done that?

Finally, what I would do if I want to report test

set error is to take the parameter theta five that are selected and

look at how well it does on my test sets.

And once again,

here is as if we fit this parameter theta to my cross-validation sets.

Which is why I'm saving aside a separate test set.

That I'm going to use to get a better estimate of how well my

parameter vector theta will generalize to previously unseen examples.

So that's model selection applied to selecting

the regularization parameter lambda.

The last thing I'd like to do in this video is get a better understanding of

how cross-validation and

training error of vary as we vary the regularization parameter lambda.

And so just a reminder, all right?

That was our original cross function J of theta.

But for this purpose,

we're going to define training error without using a regularization parameter,

and cross-validation error without using the regularization parameter.