0:00

By now you've seen a couple different learning algorithms, linear regression

and logistic regression. They work well for many problems, but

when you apply them to certain machine learning applications they can run into a

problem called over fitting that can cause them to perform very poorly.

What I'd like to do in this video is explain to you what is this over fitting

problem, and in the next few videos after this, we'll talk about a technique called

regularization, that will allow us to or to ameliorate, or to reduce this over

fitting problem and get these learning algorithms to work much better.

So, what is over fitting? Let's keep using our running example of

predicting housing prices with linear regression where we want to predict the

price as a function of the size of the house.

One thing we could do is fit a linear function to this data.

And if we do that, maybe we get that sort of straight line fit to the data.

But this isn't a very good model. Looking at the data, it seems pretty

clear that, as the size of the house increases.

The, housing prices plateau, or kind of flattens out as we move to the

right. And so, this, algorithm doesn't fit the

trading set very well, and we call this problem underfitting,

and another term for this is that this algorithm has high bias.

both of these roughly mean that it's just not even fitting the trading data very

well. The term bias is kind of a historical, or

technical one, but the idea is that if fitting a straight line to the data, then

it's as if the algorithm has a very strong preconception, or a very strong

bias that housing prices are going to vary linearly with their size.

And despite the data to the contrary, despite the evidence to the contrary, as

preconceptions still bias, still causes it to fit a straight line and this ends

up being a poor fit to the data. Now, in the middle,

we could fit a quadratic function to the data, and with this data set,

we fit a quadratic function, maybe we get that kind of curve, and that works pretty

well. And at the other extreme would be if we

were to fit say, a fourth of the polynomials of the data.

So here, we have five parameters, theta zero through theta four.

And with that, we can actually fill the curve that process through all five of

our training examples. We might get a curve that looks like

this. That, on the one hand, seems to do a very

good job fitting the trading set. It is processed through all of my data at

least, but this is sort of a very wiggly curve.

It's like going up and down going all over the place, and we don't actually

think that's such a good model for predicting housing prices.

So, this problem we call overfitting, and another term for this is that this

algorithm has high variance. The term high variance, is another, sort

of historical, or technical one, but the intuition is that, if we're

fitting such a high older polynomial, then the hypothesis can fit, you know, is

almost as if it can fit almost any function, and the space of the possible

hypothesis, is just to large, or is to variable.

And we don't have enough data to constrain it, to give us a good

hypothesis. So that's how overfitting, and in the

middle there isn't really a name, but so I'm just going to write, you know, just

write where a second degree polynomial, a quadratic function, seems to be just

right for fitting this data. To recap a bit, the problem of

overfitting comes when, if we have too many features,

then the learn hypothesis may fit the trading set very well.

So, your cost function may actually be very close to zero,

or maybe even zero exactly. But you may then end up with a curve like

this, that you know, tries too hard to fit the training set, so that it even

fails to generalize to new examples. And it fails to predict prices on new

examples well. And here the term generalize refers to

how well a hypothesis applies even to new examples.

That is to data, to houses that it hasn't seen in the training set.

On this slide, we looked at overfitting for the case of linear regression.

A similar thing can apply to logistic regression as well.

Here's the other logistic regression example, with two features, x1 and x2.

One thing we could do is fit logistic regression with just a simple hypothesis

like this, where, as usual, g is my sigmoid

function. And if you do that, you end up with a

hypothesis, trying to use maybe just a straight line to separate the positive

and the negative examples. And this doesn't look at a, like a very

good fit to the hypothesis. And so, once again, this is an example of

underfitting, or of a hypothesis having high bias.

In contrast if you were to add to your features these quadratic terms, then you

could get a decision boundary that might look more like this.

And, you know, that's a pretty good fit to the data,

probably what probably about as good as we could get on this training set.

And, finally at the other extreme, if you were to fit a very high order polynomial.

If you were to generate lots of high-order polynomial terms as features,

then the logistic progression may contort itself.

We tried really hard to find a decision boundary that

fits your training data, or go to great lengths to contort a cell to fit every

single training example well. And, you know, if the features X1 and X2

are for predicting maybe the cancer tumor, you know cancer is a malignant

benign breast tumors. This doesn't, this really doesn't look like a very good

hypothesis for making predictions. And so once again, this is an instance of

overfitting, and of hypothesis having high variance and not really, and being

unlikely to generalize well to new examples.

Later in this course, when we talk about debugging and diagnosing things that can

go wrong with learning algorithms. We'll give you specific tools to

recognize when over fitting and also when under fitting may be occurring.

But for now, let's talk about the problem of, if we think overfitting is occurring,

what can we do to address it? In the previous examples, we had one or

two dimensional data, so we could just plot the hypothesis and see what was

going on and select the appropriate degree polynomial.

So earlier, for the housing prices example, we could

just plot the hypothesis and you know maybe see that it was fitting this other

very wiggly function that goes all over the place predicting housing prices and

we could then use figure like these to select an appropriate degree polynomial.

So plotting hypothesis could be one way to try to decide what degree polynomial

to use, but that doesn't always work, and in fact more often,

we may have learning problems that, where we just have a lot of features.

And there is not just a matter of selecting

what degree polynomial, and in fact, it, when we have so many features, it also

becomes much harder to plot the data, and becomes much harder to visualize it, to

decide what features to complete or not. So concretely, if we're trying to predict

housing prices, sometimes we can just have a lot of different features, and all

of the features seem, you know, maybe they seem kind of useful, but if we have

a lot of features and very little training data, then overfitting can

become a problem. In order to address overfitting, there

are two main options for things that we can do.

The first option is to try to reduce the number of features.

One thing that we could do is manually look at the list of features and use that

to try to decide which are the more important features, and therefore which

are the features we should keep and which of the features we should throw out.

Later in this class we'll also talk about model selection algorithms which are

algorithms. But automatically deciding which features

to keep and which features to throw out. This idea of reducing the number of

features can work well and can reduce over fitting and when we talk about model

selection we'll go into this in much greater depth.

But a disadvantage is that by throwing away some of the features, it's also

throwing away some of the information you have about the problem.

For example, maybe all of those features are actually useful for predicting the

price of a house, so maybe we don't actually want to throw some of our

information or throw some of our features away.

The second option, which we'll talk in, which we'll talk

about in the next few videos, is regularization.

Here, we're going to keep all the features that we're going to reduce the

magnitude, or the values of the parameters, theta j.

And this method works well, we'll see, when we have a lot of features.

Each of which contributes a little bit to predicting the value of y.

Like we had, like we saw in the housing price prediction example,

where we could have a lot of features, each of which are, you know, somewhat

useful, so maybe we don't want to throw them

away. So this describes the idea of

regularization at a very high level, and, I realize that all of these details

probably don't make sense to you yet. But in the next video we'll start to

formulate exactly how, to apply regularization and exactly what

regularization means. And, then we'll start to figure out how

to use this to make our learning algorithms work well and avoid

overfitting.