In this video, we'll discuss how to regularize our models,

how to reduce their complexity,

so they don't overfit.

As you remember there was an example with eight data points and eight parameters in

our linear model and this model overfitted to

our data and the parameters of this model were very large.

But if we use the appropriate model for the same problem,

in this case it's a model with three features

x. X with a second degree and x with a third degree.

Then the model will be good,

it will fit the target function,

the green line, and the parameters will be not very high.

Actually, we can look at this property that overfitted model have large weights and

good models have not very large weights to solve the problem of overfitting.

To do it, we modify our loss function.

So we take our initial loss function,

L of W and we add a regularizer,

R of W that penalizes our model for large weights.

We end it with coefficient lambda,

regularization strengths that controls the tradeoff between

model quality on a training set and model complexity.

And then we minimize this new loss function,

L of W plus lambda multiplied by R of W. For example,

we can use L2 penalty as a regularizer,

it just sums the squares of our parameters,

not including the bias, that's important.

So this is a very simple penalty,

it's differentiable, so we can use any gradient descent method to optimize it.

And this regularizer, just drives all the coefficient closer to zero.

So it penalizes our model for very large weights.

Actually, it can be shown this unconstrained optimization problem

is equivalent to constraint optimization problem.

We just take our initial loss function L of W,

minimize it with respect to W and have it constrained that the L to norm,

squared L to norm of our weight vector,

of our parameter vector is no larger than C,

where there is a one to one correspondence between C and lambda regularization strength.

So we take our loss function and we select the closest point to the minimum of

this function that lies inside the ball of the radius R with the center at zero.

And then if we return to our example with eight data points and model of

eighth degree and apply an L2 penalty with regularization coefficient one,

then we get this model.

It's much more simpler than the previous model.

It fits our true target function well and the co-efficient are not very large.

So, L2 penalty does its job.

There is another penalty function called L1 penalty.

You need to take absolute values of all weights and sum them.

And once again, we don't include the bias into this sum.

This regularizer is not differentiable because there is

no derivative of the absolute value as zero,

so we need to use some advanced optimization techniques to optimize this function.

But this penalty has a nice property.

There's at least two sparse solution.

It drives some coefficient,

some parameters exactly to zero.

So our model depends only on some subset of features.

Once again we can show that

these unconstrained optimization problem is equivalent to constraint,

where we minimize our initial loss L of W and have a constraint

that L1 norm for our weight vector is no larger than

C. And in our example with eight data points,

if we use L1 penalty with 0.01 coefficient,

then we get this solution.

It's too good, it fits data well and also four of eight coefficients are zero,

so the solution is indeed sparse.

Of course there're other regularization techniques.

For example, we could reduce a dimensionality of our data.

For example, remove some redundant features or maybe apply

principal component analysis to get some new good features or we can augment our data.

For example, if we work with images,

we can distort them,

flip, rotate, or something else.

So we have more data and it's harder for our model to overfit on it.

We can use dropout that we'll discuss in full in the weeks of our course.

We can somehow use early stopping.

So if we use gradient descent,

we can stop, for example, at hundredth iteration,

so our model doesn't have a way to overfit,

it stops early and it underfits to our data.

And of course, we can just collect more data.

The more data we have, the harder it's for our model to overfit.

So on large samples it just should generalize,

it should learn some dependences from our data.

In this video, we discussed regularization techniques.

For example, L2 and L1 penalties that were good for linear models.

And we mentioned some other regularization techniques that are good for larger models,

for example, for neural networks.

And we'll discuss these regularization techniques in details in our following weeks.