0:00

By now, you've seen a

couple different learning algorithms, linear

regression and logistic regression.

They work well for many problems,

but when you apply them

to certain machine learning applications, they

can run into a problem called

overfitting that can cause them to perform very poorly.

What I'd like to do in

this video is explain to

you what is this overfitting

problem, and in the

next few videos after this,

we'll talk about a technique called

regularization, that will allow

us to ameliorate or to

reduce this overfitting problem and

get these learning algorithms to maybe work much better.

So what is overfitting?

Let's keep using our running

example of predicting housing

prices with linear regression

where we want to predict the

price as a function of the size of the house.

One thing we could do is

fit a linear function to

this data, and if we

do that, maybe we get

that sort of straight line fit to the data.

But this isn't a very good model.

Looking at the data, it seems

pretty clear that as the

size of the housing increases, the

housing prices plateau, or kind

of flattens out as we move to the right and so

this algorithm does not

fit the training and we

call this problem underfitting, and

another term for this is

that this algorithm has high bias.

Both of these roughly

mean that it's just not even fitting the training data very well.

The term is kind of

a historical or technical one,

but the idea is that

if a fitting a straight line to

the data, then, it's as

if the algorithm has a

very strong preconception, or a

very strong bias that housing

prices are going to vary

linearly with their size and despite the data to the contrary.

Despite the evidence of the

contrary is preconceptions still

are bias, still closes

it to fit a straight line

and this ends up being a poor fit to the data.

Now, in the middle, we could

fit a quadratic functions enter and,

with this data set, we fit the

quadratic function, maybe, we get

that kind of curve

and, that works pretty well.

And, at the other extreme, would be if we were to fit, say a fourth other polynomial to the data.

So, here we have five parameters,

theta zero through theta four,

and, with that, we can actually fill a curve

that process through all five of our training examples.

You might get a curve that looks like this.

2:57

The term high variance is another

historical or technical one.

But, the intuition is that,

if we're fitting such a high

order polynomial, then, the

hypothesis can fit, you know,

it's almost as if it can

fit almost any function and

this face of possible hypothesis

is just too large, it's too variable.

And we don't have enough data

to constrain it to give

us a good hypothesis so that's called overfitting.

And in the middle, there isn't really

a name but I'm just going to write, you know, just right.

Where a second degree polynomial, quadratic function

seems to be just right for fitting this data.

To recap a bit the

problem of over fitting comes

when if we have

too many features, then to

learn hypothesis may fit the training side very well.

So, your cost function

may actually be very close

to zero or may be

even zero exactly, but you

may then end up with a

curve like this that, you

know tries too hard to

fit the training set, so that it

even fails to generalize to

new examples and fails to

predict prices on new examples

as well, and here the

term generalized refers to

how well a hypothesis applies even to new examples.

That is to data to

houses that it has not seen in the training set.

On this slide, we looked at

over fitting for the case of linear regression.

A similar thing can apply to logistic regression as well.

Here is a logistic regression

example with two features X1 and x2.

One thing we could do, is

fit logistic regression with

just a simple hypothesis like this,

where, as usual, G is my sigmoid function.

And if you do that, you end up

with a hypothesis, trying to

use, maybe, just a straight

line to separate the positive and the negative examples.

And this doesn't look like a very good fit to the hypothesis.

So, once again, this

is an example of underfitting

or of the hypothesis having high bias.

In contrast, if you were

to add to your features

these quadratic terms, then,

you could get a decision

boundary that might look more like this.

And, you know, that's a pretty good fit to the data.

Probably, about as

good as we could get, on this training set.

And, finally, at the other

extreme, if you were to

fit a very high-order polynomial, if

you were to generate lots of

high-order polynomial terms of speeches,

then, logistical regression may contort

itself, may try really

hard to find a

decision boundary that fits

your training data or go

to great lengths to contort itself,

to fit every single training example well.

And, you know, if the

features X1 and

X2 offer predicting, maybe,

the cancer to the,

you know, cancer is a malignant, benign breast tumors.

This doesn't, this really doesn't

look like a very good hypothesis, for making predictions.

And so, once again, this is

an instance of overfitting

and, of a hypothesis having

high variance and not really,

and, being unlikely to generalize well to new examples.

Later, in this course, when we

talk about debugging and diagnosing

things that can go wrong with

learning algorithms, we'll give you

specific tools to recognize

when overfitting and, also,

when underfitting may be occurring.

But, for now, lets talk about

the problem of, if we

think overfitting is occurring,

what can we do to address it?

In the previous examples, we had

one or two dimensional data so,

we could just plot the hypothesis and see what was going

on and select the appropriate degree polynomial.

So, earlier for the housing

prices example, we could just

plot the hypothesis and, you

know, maybe see that it

was fitting the sort of

very wiggly function that goes all over the place to predict housing prices.

And we could then use figures

like these to select an appropriate degree polynomial.

So plotting the hypothesis, could

be one way to try to

decide what degree polynomial to use.

But that doesn't always work.

And, in fact more often we

may have learning problems that where we just have a lot of features.

And there is not

just a matter of selecting what degree polynomial.

And, in fact, when we

have so many features, it also

becomes much harder to plot

the data and it becomes

much harder to visualize it,

to decide what features to keep or not.

So concretely, if we're trying

predict housing prices sometimes we can just have a lot of different features.

And all of these features seem, you know, maybe they seem kind of useful.

But, if we have a

lot of features, and, very little

training data, then, over

fitting can become a problem.

In order to address over

fitting, there are two

main options for things that we can do.

The first option is, to try

to reduce the number of features.

Concretely, one thing we

could do is manually look through

the list of features, and, use

that to try to decide which

are the more important features, and, therefore,

which are the features we should

keep, and, which are the features we should throw out.

Later in this course, where also

talk about model selection algorithms.

Which are algorithms for automatically

deciding which features

to keep and, which features to throw out.

This idea of reducing the

number of features can work

well, and, can reduce over fitting.

And, when we talk about model

selection, we'll go into this in much greater depth.

But, the disadvantage is that, by

throwing away some of the

features, is also throwing

away some of the information you have about the problem.

For example, maybe, all of

those features are actually useful

for predicting the price of a

house, so, maybe, we don't actually

want to throw some of

our information or throw some of our features away.