0:00

In the last video you learned about gradient checking.

In this video, I want to share with you some practical tips or

some notes on how to actually go about implementing this for your neural network.

First, don't use grad check in training, only to debug.

So what I mean is that, computing d theta approx i, for

all the values of i, this is a very slow computation.

So to implement gradient descent, you'd use backprop to compute d theta and

just use backprop to compute the derivative.

And it's only when you're debugging that you would compute this

to make sure it's close to d theta.

But once you've done that, then you would turn off the grad check, and

don't run this during every iteration of gradient descent,

because that's just much too slow.

Second, if an algorithm fails grad check, look at the components,

look at the individual components, and try to identify the bug.

So what I mean by that is if d theta approx is very far from d theta,

what I would do is look at the different values of i to see which are the values of

d theta approx that are really very different than the values of d theta.

So for example, if you find that the values of theta or d theta,

they're very far off, all correspond to dbl for some layer or for

some layers, but the components for dw are quite close, right?

Remember, different components of theta correspond to different components

of b and w.

When you find this is the case, then maybe you find that the bug is in how

you're computing db, the derivative with respect to parameters b.

And similarly, vice versa, if you find that the values that are very far,

the values from d theta approx that are very far from d theta,

you find all those components came from dw or from dw in a certain layer,

then that might help you hone in on the location of the bug.

This doesn't always let you identify the bug right away, but

sometimes it helps you give you some guesses about where to track down the bug.

1:56

Next, when doing grad check,

remember your regularization term if you're using regularization.

So if your cost function is J of theta equals 1 over m sum of your

losses and then plus this regularization term.

And sum over l of wl squared, then this is the definition of J.

And you should have that d theta is gradient of J with

respect to theta, including this regularization term.

So just remember to include that term.

Next, grad check doesn't work with dropout, because in every iteration,

dropout is randomly eliminating different subsets of the hidden units.

There isn't an easy to compute cost function J that dropout is

doing gradient descent on.

It turns out that dropout can be viewed as optimizing some cost function J, but

it's cost function J defined by summing over all exponentially large

subsets of nodes they could eliminate in any iteration.

So the cost function J is very difficult to compute, and

you're just sampling the cost function

every time you eliminate different random subsets in those we use dropout.

So it's difficult to use grad check to double check your

computation with dropouts.

So what I usually do is implement grad check without dropout.

So if you want, you can set keep-prob and dropout to be equal to 1.0.

And then turn on dropout and hope that my implementation of dropout was correct.

3:30

There are some other things you could do, like fix the pattern of nodes dropped and

verify that grad check for that pattern of [INAUDIBLE] is correct, but

in practice I don't usually do that.

So my recommendation is turn off dropout, use grad check to double check that your

algorithm is at least correct without dropout, and then turn on dropout.

Finally, this is a subtlety.

It is not impossible, rarely happens, but it's not impossible that your

implementation of gradient descent is correct when w and b are close to 0, so

at random initialization.

But that as you run gradient descent and w and b become bigger,

maybe your implementation of backprop is correct only when w and b is close to 0,

but it gets more inaccurate when w and b become large.

So one thing you could do, I don't do this very often,

but one thing you could do is run grad check at random initialization and

then train the network for a while so that w and

b have some time to wander away from 0, from your small random initial values.

And then run grad check again after you've trained for some number of iterations.

So that's it for gradient checking.

And congratulations for coming to the end of this week's materials.

In this week, you've learned about how to set up your train, dev, and test sets,

how to analyze bias and variance and what things to do if you have high bias versus

high variance versus maybe high bias and high variance.

You also saw how to apply different forms of regularization,

like L2 regularization and dropout on your neural network.

So some tricks for speeding up the training of your neural network.

And then finally, gradient checking.

So I think you've seen a lot in this week and

you get to exercise a lot of these ideas in this week's programming exercise.

So best of luck with that, and

I look forward to seeing you in the week two materials.