0:00

all right I think that's be an exciting

Â video in this video you see how to

Â implement gradient descent for your

Â neural network with one hidden layer in

Â this video I'm going to just give you

Â the equations you need to implement in

Â order to get back propagation of the

Â gradient descent working and then in the

Â video after this one I'll give some more

Â intuition about why these particular

Â equations are the accurate equations or

Â the correct equations for computing the

Â gradients you need for your neural

Â network

Â so your neural network with a single

Â peasant layer for now will have

Â parameters W 1 V 1 W 2 and B 2 and so as

Â a reminder if you have NX alternative

Â the UM n 0 input features and N 1 hidden

Â units and n 2 output units in our

Â example so far don't yet n 2 equals 1

Â then the matrix W 1 will be N 1 by n 0 V

Â 1 will be an N 1 dimensional vector so

Â you can write down as a 10-1 by 1

Â dimensional matrix really a column

Â vector the dimensions of W 2 will be n 2

Â by N 1 and the dimension of B 2 will be

Â n 2 by 1 right where again so far we've

Â only seen examples where n 2 is equal to

Â 1 where you have just one a single

Â hidden unit so you also have a cost

Â function for a neural network and for

Â now I'm just going to assume that you're

Â doing binary classification so in that

Â case the cost of your parameters as

Â follows is going to be 1 over m of the

Â average of that loss function and so L

Â here is the loss when your new network

Â predicts y hat right this is really a a

Â 2 when the ground should 0 is equal to Y

Â and if you're doing binary

Â classification the loss function can be

Â exactly what you use for logistic

Â earlier so to train the parameters your

Â algorithms you need to perform gradient

Â descent when training a neural network

Â is important to initialize the

Â parameters randomly rounded into all

Â zeros

Â we'll see later why that's the case but

Â after initializing the parameter to

Â something each loop of gradient descent

Â would compute the predictions

Â so you basically compute you know y hat

Â I for I equals 1 through m say and then

Â you need to compute the derivative so

Â you need to compute DW 1 and that's we

Â see the derivative of the cost function

Â with respect to the parameter W 1 you

Â need to compute another variable which

Â is going to call DP 1 which is the

Â derivative or the slope of your cost

Â function with respect to the variable B

Â 1 and so on similarly for the other

Â parameters W 2 and B 2 and then finally

Â the gradient descent update would be to

Â update W 1 as W 1 minus alpha the

Â learning rate times d w1 v 1 gets

Â updated as B 1 minus the learning rate

Â times D B 1 as similarly for W 2 and B 2

Â and sometimes I use colon equals and

Â sometimes equals as either either the

Â notation works fine and so this would be

Â one iteration of gradient descent and

Â then your repeat this some number of

Â times until your parameters look like

Â they're converging so in previous videos

Â we talked about how to compute the

Â predictions how to compute the outputs

Â and we saw how to do that in a

Â vectorized way as well so the key is to

Â know how to compute these partial

Â derivative terms the DW 1 DB 1 as well

Â as the derivatives BW 2 and DP 2 so what

Â I'd like to do is just give you the

Â equations you need in order to compute

Â these derivatives and I'll defer to the

Â next video which is an optional video to

Â go great turn to Jeff about how we came

Â up with those formulas

Â so then just summarize again the

Â equations for for propagation so you

Â have Z 1 equals W 1 X plus B 1 and then

Â a 1 equals the activation function in

Â that layer applied other than Y since V

Â 1 and then Z 2 equals W 2 A 1 plus B 2

Â and then finally difference all

Â vectorize across your training set right

Â a 2 is equal to G 2 of V 2 looking for

Â now if we assume you're doing binary

Â classification then this activation

Â function really should be the sigmoid

Â function so I'm just throw that in here

Â so that's the forward propagation or the

Â left-to-right forward computation for

Â your neural network next let's compute

Â the derivatives so this is the back

Â propagation step then it computes DZ 2

Â equals a 2 minus the ground truth Y and

Â just just as a reminder all this is

Â vectorize across example so the matrix Y

Â is this sum 1 by M matrix that lists all

Â of your M examples horizontally then it

Â turns out DW 2 is equal to this in fact

Â um these first three equations are very

Â similar to gradient descent for logistic

Â regression

Â X is equals 1 comma um keep DIMMs equals

Â true and just a little detail this NP

Â dot some is a Python numpy commands or

Â something across one dimension of a

Â matrix in this case summing horizontally

Â and what q DIMMs does is it prevents

Â - from outputting one of those funny

Â bank 1 various red where the dimensions

Â was you know n comma so by having keep

Â them sequels true this ensures that

Â Python outputs 4gb to a vector that is

Â sum and buy one technically this will be

Â I guess n to buy one in this case is

Â just a one by one number so maybe it

Â doesn't matter but later on we'll see

Â when it really matters so so far what

Â we've done is very similar to logistic

Â regression but now as you compute

Â continue to run back propagation you

Â would compute this Z 2 times G 1 prime

Â of Z 1 so this quantity G 1 prime is the

Â derivative of whatever was the

Â activation function you use for the

Â hidden layer and for the output layer I

Â assume that you're doing binary

Â classification with the sigmoid function

Â so that's already baked into that

Â formula for DZ 2 and this times is a

Â element-wise product so this year

Â there's going to be an N 1 by M matrix

Â and this here this element wise

Â derivative thing is also going to be an

Â N 1 by n matrix and so this times there

Â is another min wise product of two

Â matrices then finally DW 1 is equal to

Â that and DV 1 is equal to this and P dot

Â some D Z 1 X is equals 1 keep Tim's

Â equals true so whereas

Â previously the cheap business maybe

Â matter less if n2 is equal to 1000 sofa

Â one by one thing is there's a real

Â number here P p1 will be a n 1 by 1

Â vector and so you want Python you want n

Â P dot some output something of this

Â dimension rather than a family Matic one

Â array of that dimension which could end

Â up messing up some of your later

Â calculations the other way would be to

Â not have to keep them parameters but to

Â explicitly call in a reshape to reshape

Â the output of NP dot some into this

Â dimension which you would like DB so how

Â so that was for propagation in I guess

Â four equations and back propagation in I

Â guess six equations I know I just wrote

Â down these equations but in the next

Â optional video let's go over some

Â intuitions for how the six equations for

Â the back propagation algorithm were

Â derived please feel free to watch that

Â or not but either way if you implement

Â these algorithms you will have a correct

Â implementation of for profit backdrop

Â and you'll be able to compute the

Â derivatives you need in order to apply

Â gradient descent to learn the parameters

Â of your neural network it is possible to

Â implement design room and get it to work

Â without deeply understanding the

Â calculus a lot of successful people

Â earning practitioners do so but if you

Â want you can also watch the next video

Â just to get a bit more intuition about

Â the derivation of these of these

Â equations

Â