0:00

all right I think that's be an exciting

video in this video you see how to

implement gradient descent for your

neural network with one hidden layer in

this video I'm going to just give you

the equations you need to implement in

order to get back propagation of the

gradient descent working and then in the

video after this one I'll give some more

intuition about why these particular

equations are the accurate equations or

the correct equations for computing the

gradients you need for your neural

network

so your neural network with a single

peasant layer for now will have

parameters W 1 V 1 W 2 and B 2 and so as

a reminder if you have NX alternative

the UM n 0 input features and N 1 hidden

units and n 2 output units in our

example so far don't yet n 2 equals 1

then the matrix W 1 will be N 1 by n 0 V

1 will be an N 1 dimensional vector so

you can write down as a 10-1 by 1

dimensional matrix really a column

vector the dimensions of W 2 will be n 2

by N 1 and the dimension of B 2 will be

n 2 by 1 right where again so far we've

only seen examples where n 2 is equal to

1 where you have just one a single

hidden unit so you also have a cost

function for a neural network and for

now I'm just going to assume that you're

doing binary classification so in that

case the cost of your parameters as

follows is going to be 1 over m of the

average of that loss function and so L

here is the loss when your new network

predicts y hat right this is really a a

2 when the ground should 0 is equal to Y

and if you're doing binary

classification the loss function can be

exactly what you use for logistic

earlier so to train the parameters your

algorithms you need to perform gradient

descent when training a neural network

is important to initialize the

parameters randomly rounded into all

zeros

we'll see later why that's the case but

after initializing the parameter to

something each loop of gradient descent

would compute the predictions

so you basically compute you know y hat

I for I equals 1 through m say and then

you need to compute the derivative so

you need to compute DW 1 and that's we

see the derivative of the cost function

with respect to the parameter W 1 you

need to compute another variable which

is going to call DP 1 which is the

derivative or the slope of your cost

function with respect to the variable B

1 and so on similarly for the other

parameters W 2 and B 2 and then finally

the gradient descent update would be to

update W 1 as W 1 minus alpha the

learning rate times d w1 v 1 gets

updated as B 1 minus the learning rate

times D B 1 as similarly for W 2 and B 2

and sometimes I use colon equals and

sometimes equals as either either the

notation works fine and so this would be

one iteration of gradient descent and

then your repeat this some number of

times until your parameters look like

they're converging so in previous videos

we talked about how to compute the

predictions how to compute the outputs

and we saw how to do that in a

vectorized way as well so the key is to

know how to compute these partial

derivative terms the DW 1 DB 1 as well

as the derivatives BW 2 and DP 2 so what

I'd like to do is just give you the

equations you need in order to compute

these derivatives and I'll defer to the

next video which is an optional video to

go great turn to Jeff about how we came

up with those formulas

so then just summarize again the

equations for for propagation so you

have Z 1 equals W 1 X plus B 1 and then

a 1 equals the activation function in

that layer applied other than Y since V

1 and then Z 2 equals W 2 A 1 plus B 2

and then finally difference all

vectorize across your training set right

a 2 is equal to G 2 of V 2 looking for

now if we assume you're doing binary

classification then this activation

function really should be the sigmoid

function so I'm just throw that in here

so that's the forward propagation or the

left-to-right forward computation for

your neural network next let's compute

the derivatives so this is the back

propagation step then it computes DZ 2

equals a 2 minus the ground truth Y and

just just as a reminder all this is

vectorize across example so the matrix Y

is this sum 1 by M matrix that lists all

of your M examples horizontally then it

turns out DW 2 is equal to this in fact

um these first three equations are very

similar to gradient descent for logistic

regression

X is equals 1 comma um keep DIMMs equals

true and just a little detail this NP

dot some is a Python numpy commands or

something across one dimension of a

matrix in this case summing horizontally

and what q DIMMs does is it prevents

- from outputting one of those funny

bank 1 various red where the dimensions

was you know n comma so by having keep

them sequels true this ensures that

Python outputs 4gb to a vector that is

sum and buy one technically this will be

I guess n to buy one in this case is

just a one by one number so maybe it

doesn't matter but later on we'll see

when it really matters so so far what

we've done is very similar to logistic

regression but now as you compute

continue to run back propagation you

would compute this Z 2 times G 1 prime

of Z 1 so this quantity G 1 prime is the

derivative of whatever was the

activation function you use for the

hidden layer and for the output layer I

assume that you're doing binary

classification with the sigmoid function

so that's already baked into that

formula for DZ 2 and this times is a

element-wise product so this year

there's going to be an N 1 by M matrix

and this here this element wise

derivative thing is also going to be an

N 1 by n matrix and so this times there

is another min wise product of two

matrices then finally DW 1 is equal to

that and DV 1 is equal to this and P dot

some D Z 1 X is equals 1 keep Tim's

equals true so whereas

previously the cheap business maybe

matter less if n2 is equal to 1000 sofa

one by one thing is there's a real

number here P p1 will be a n 1 by 1

vector and so you want Python you want n

P dot some output something of this

dimension rather than a family Matic one

array of that dimension which could end

up messing up some of your later

calculations the other way would be to

not have to keep them parameters but to

explicitly call in a reshape to reshape

the output of NP dot some into this

dimension which you would like DB so how

so that was for propagation in I guess

four equations and back propagation in I

guess six equations I know I just wrote

down these equations but in the next

optional video let's go over some

intuitions for how the six equations for

the back propagation algorithm were

derived please feel free to watch that

or not but either way if you implement

these algorithms you will have a correct

implementation of for profit backdrop

and you'll be able to compute the

derivatives you need in order to apply

gradient descent to learn the parameters

of your neural network it is possible to

implement design room and get it to work

without deeply understanding the

calculus a lot of successful people

earning practitioners do so but if you

want you can also watch the next video

just to get a bit more intuition about

the derivation of these of these

equations