0:00

When you implement back propagation you'll find that there's a test called

Â creating checking that can really help you make sure

Â that your implementation of back prop is correct.

Â Because sometimes you write all these equations and you're just not 100% sure if

Â you've got all the details right and internal back propagation.

Â So in order to build up to gradient and checking,

Â let's first talk about how to numerically approximate computations of gradients and

Â in the next video, we'll talk about how you can implement

Â gradient checking to make sure the implementation of backdrop is correct.

Â So lets take the function f and replot it here and remember this is

Â f of theta equals theta cubed, and let's again start off to some value of theta.

Â Let's say theta equals 1.

Â Now instead of just nudging theta to the right to get theta plus epsilon,

Â we're going to nudge it to the right and

Â nudge it to the left to get theta minus epsilon, as was theta plus epsilon.

Â So this is 1, this is 1.01, this is 0.99 where, again,

Â epsilon is same as before, it is 0.01.

Â It turns out that rather than taking this little triangle and

Â computing the height over the width, you can get a much better estimate of

Â the gradient if you take this point, f of theta minus epsilon and this point,

Â and you instead compute the height over width of this bigger triangle.

Â So for technical reasons which I won't go into, the height over width of this bigger

Â green triangle gives you a much better approximation to the derivative at theta.

Â And you saw it yourself, taking just this lower triangle in the upper right

Â is as if you have two triangles, right?

Â This one on the upper right and this one on the lower left.

Â And you're kind of taking both of them into account

Â by using this bigger green triangle.

Â So rather than a one sided difference, you're taking a two sided difference.

Â So let's work out the math.

Â This point here is F of theta plus epsilon.

Â This point here is F of theta minus epsilon.

Â So the height of this big green triangle is f of theta plus epsilon

Â minus f of theta minus epsilon.

Â And then the width, this is 1 epsilon, this is 2 epsilon.

Â So the width of this green triangle is 2 epsilon.

Â So the height of the width is going to be first the height, so

Â that's F of theta plus epsilon minus F of theta minus epsilon divided by the width.

Â So that was 2 epsilon which we write that down here.

Â 2:38

And this should hopefully be close to g of theta.

Â So plug in the values, remember f of theta is theta cubed.

Â So this is theta plus epsilon is 1.01.

Â So I take a cube of that minus 0.99 theta cube of that divided by 2 times 0.01.

Â Feel free to pause the video and practice in the calculator.

Â You should get that this is 3.0001.

Â Whereas from the previous slide, we saw that g of theta,

Â this was 3 theta squared so when theta was 1, so

Â these two values are actually very close to each other.

Â The approximation error is now 0.0001.

Â Whereas on the previous slide, we've taken the one sided

Â of difference just theta + theta + epsilon we had gotten 3.0301 and

Â so the approximation error was 0.03 rather than 0.0001.

Â So this two sided difference way of

Â approximating the derivative you find that this is extremely close to 3.

Â And so this gives you a much greater confidence that g of theta is

Â probably a correct implementation of the derivative of F.

Â 3:58

When you use this method for grading, checking and back propagation,

Â this turns out to run twice as slow as you were to use a one-sided defense.

Â It turns out that in practice I think it's worth it to use this other method because

Â it's just much more accurate.

Â The little bit of optional theory for

Â those of you that are a little bit more familiar of Calculus, it turns out that,

Â and it's okay if you don't get what I'm about to say here.

Â But it turns out that the formal definition of a derivative is for

Â very small values of epsilon is f of theta plus epsilon minus f of theta

Â minus epsilon over 2 epsilon.

Â And the formal definition of derivative is in the limits of exactly

Â that formula on the right as epsilon those as 0.

Â And the definition of unlimited is something that you learned if you

Â took a Calculus class but I won't go into that here.

Â And it turns out that for a non zero value of epsilon,

Â you can show that the error of this approximation is on the order

Â of epsilon squared, and remember epsilon is a very small number.

Â So if epsilon is 0.01 which it is here then epsilon squared is 0.0001.

Â The big O notation means the error is actually some constant times this, but

Â this is actually exactly our approximation error.

Â So the big O constant happens to be 1.

Â Whereas in contrast if we were to use this formula, the other one,

Â then the error is on the order of epsilon.

Â And again, when epsilon is a number less than 1, then epsilon is actually

Â much bigger than epsilon squared which is why this formula here is actually

Â much less accurate approximation than this formula on the left.

Â Which is why when doing gradient checking, we rather use this two-sided difference

Â when you compute f of theta plus epsilon minus f of theta minus epsilon and then

Â divide by 2 epsilon rather than just one sided difference which is less accurate.

Â 6:13

So you've seen how by taking a two sided difference,

Â you can numerically verify whether or not a function g, g of theta that someone

Â else gives you is a correct implementation of the derivative of a function f.

Â Let's now see how we can use this to verify whether or

Â not your back propagation implementation is correct or

Â if there might be a bug in there that you need to go and tease out

Â