0:09

Concretely, here's the gradient descent update rule.

And what I want to do in this video is tell you

about what I think of as debugging, and some tips for

making sure that gradient descent is working correctly.

And second, I wanna tell you how to choose the learning rate alpha or

at least how I go about choosing it.

Here's something that I often do to make sure that gradient descent is working

correctly.

The job of gradient descent is to find the value of theta for

you that hopefully minimizes the cost function J(theta).

What I often do is therefore plot the cost function J(theta) as

gradient descent runs.

So the x axis here is a number of iterations of gradient descent and

as gradient descent runs you hopefully get a plot that maybe looks like this.

0:59

Notice that the x axis is number of iterations.

Previously we where looking at plots of J(theta) where the x axis, where

the horizontal axis, was the parameter vector theta but this is not what this is.

Concretely, what this point is,

is I'm going to run gradient descent for 100 iterations.

And whatever value I get for theta after 100 iterations,

I'm going to get some value of theta after 100 iterations.

And I'm going to evaluate the cost function J(theta).

For the value of theta I get after 100 iterations,

and this vertical height is the value of J(theta).

For the value of theta I got after 100 iterations of gradient descent.

And this point here that corresponds to the value of J(theta) for

the theta that I get after I've run gradient descent for 200 iterations.

1:55

So what this plot is showing is, is it's showing the value of your cost function

after each iteration of gradient decent.

And if gradient is working properly then J(theta)

should decrease after every iteration.

2:17

And one useful thing that this sort of plot can tell you also is that if you look

at the specific figure that I've drawn, it looks like by the time you've gotten

out to maybe 300 iterations, between 300 and 400 iterations,

in this segment it looks like J(theta) hasn't gone down much more.

So by the time you get to 400 iterations,

it looks like this curve has flattened out here.

And so way out here 400 iterations, it looks like gradient descent has more or

less converged because your cost function isn't going down much more.

So looking at this figure can also help you judge whether or

not gradient descent has converged.

2:57

By the way, the number of iterations the gradient descent takes to converge for

a physical application can vary a lot, so maybe for

one application, gradient descent may converge after just thirty iterations.

For a different application, gradient descent may take 3,000 iterations,

for another learning algorithm, it may take 3 million iterations.

It turns out to be very difficult to tell in advance how many iterations gradient

descent needs to converge.

And is usually by plotting this sort of plot, plotting the cost function as we

increase in number in iterations, is usually by looking at these plots.

But I try to tell if gradient descent has converged.

It's also possible to come up with automatic convergence test,

namely to have a algorithm try to tell you if gradient descent has converged.

And here's maybe a pretty typical example of an automatic convergence test.

And such a test may declare convergence if your cost function J(theta)

decreases by less than some small value epsilon,

some small value 10 to the minus 3 in one iteration.

But I find that usually choosing what this threshold is is pretty difficult.

And so in order to check your gradient descent's converge

I actually tend to look at plots like these, like this figure on the left,

rather than rely on an automatic convergence test.

Looking at this sort of figure can also tell you, or give you an advance warning,

if maybe gradient descent is not working correctly.

Concretely, if you plot J(theta) as a function of the number of iterations.

Then if you see a figure like this where J(theta) is actually increasing,

then that gives you a clear sign that gradient descent is not working.

And a theta like this usually means that you should be using learning rate alpha.

4:59

But if your learning rate is too big then if you start off there,

gradient descent may overshoot the minimum and send you there.

And if the learning rate is too big,

you may overshoot again and it sends you there, and so on.

So that, what you really wanted was for it to start here and for

it to slowly go downhill, right?

But if the learning rate is too big,

then gradient descent can instead keep on overshooting the minimum.

So that you actually end up getting worse and

worse instead of getting to higher values of the cost function J(theta).

So you end up with a plot like this and if you see a plot like this,

the fix is usually just to use a smaller value of alpha.

Oh, and also, of course, make sure your code doesn't have a bug of it.

But usually too large a value of alpha could be a common problem.

6:04

I'm not going to prove it here,

but under other assumptions about the cost function J, that does hold true for

linear regression, mathematicians have shown that if your learning rate

alpha is small enough, then J(theta) should decrease on every iteration.

So if this doesn't happen probably means the alpha's too big,

you should set it smaller.

But of course, you also don't want your learning rate to be too small

because if you do that then the gradient descent can be slow to converge.

6:31

And if alpha were too small, you might end up starting out here, say,

and end up taking just minuscule baby steps.

And just taking a lot of iterations before you finally get to the minimum,

and so if alpha is too small, gradient descent can make very slow progress and

be slow to converge.

To summarize, if the learning rate is too small,

you can have a slow convergence problem, and if the learning rate is too large,

J(theta) may not decrease on every iteration and it may not even converge.

In some cases if the learning rate is too large, slow convergence is also possible.

But the more common problem you see is

just that J(theta) may not decrease on every iteration.

And in order to debug all of these things, often plotting that J(theta)

as a function of the number of iterations can help you figure out what's going on.

Concretely, what I actually do when I run gradient descent

is I would try a range of values.

So just try running gradient descent with a range of values for

alpha, like 0.001 and 0.01.

So these are factor of ten differences.

And for these different values of alpha are just plot J(theta) as a function of

number of iterations, and

then pick the value of alpha that seems to be causing J(theta) to decrease rapidly.

In fact, what I do actually isn't these steps of ten.

So this is a scale factor of ten of each step up.

What I actually do is try this range of values.

8:06

And so on, where this is 0.001.

I'll then increase the learning rate threefold to get 0.003.

And then this step up,

this is another roughly threefold increase from 0.003 to 0.01.

And so these are, roughly, trying out gradient descents with each

value I try being about 3x bigger than the previous value.

So what I'll do is try a range of values until I've found one

value that's too small and made sure that I've found one value that's too large.

And then I'll sort of try to pick the largest possible value, or

just something slightly smaller than the largest reasonable value that I found.

And when I do that usually it just gives me a good learning rate for my problem.

And if you do this too, maybe you'll be able to choose a good learning rate for

your implementation of gradient descent.