0:00

in a previous video you saw the basic

Â blocks of implementing a deep neural

Â network a for propagation step for each

Â layer and a corresponding backward

Â propagation step let's see how you can

Â actually implement these steps will

Â start - for propagation recall that what

Â this will do is input al - 1 and output

Â al and the cache ZL and we just said

Â that from implementational point of view

Â maybe we'll cache WL + BL as well just

Â to make the assumptions call the easier

Â in the preliminaries eyes and so the

Â equations for this should already look

Â familiar the way to improve the forward

Â function is just this equals WL x a l

Â minus 1 plus b l and then al equals the

Â activation function applied to z and if

Â you want a vector rise implementation

Â then it's just that x a l minus 1 plus b

Â would be adding beeping pipes and

Â broadcasting and al equals G applied

Â element wise to Z and remember on the

Â diagram for the forth step where we have

Â this chain of boxes going forward so you

Â initialize that with feeding and a 0

Â which is equal to X so you initialize

Â this really what is the input to the

Â first one right is really on a 0 which

Â is the input features to either for one

Â training example if you're doing one

Â example at a time or um a practical 0

Â the entire training so if you are

Â processing the entire training site so

Â that's the input to the first forward

Â function in the chain and then just

Â repeating this allows you to compute

Â forward propagation from left to right

Â next let's talk about the backward

Â propagation step here you go is the

Â input D al and output D al minus 1 and d

Â WL & DB let me just write out the steps

Â you need to compute these things

Â pz l is equal to da l alamin Weis

Â product with G of L prime Z of L and

Â then compute the derivatives DW l equals

Â d ZL times a of L minus 1

Â I didn't explicitly put that in the

Â cache where it turns out you need this

Â as well and then DB l is equal to DZ l

Â and finally da of L minus 1 there's

Â equal to WL transpose times d ZL ok and

Â I don't want to go through the detailed

Â derivation for this but it turns out

Â that if you take this definition to DA

Â and plug it in here then you get the

Â same formula as we had in there

Â previously for how you compute d ZL as a

Â function of the previous easy L in fact

Â well if I just plug that in here you end

Â up that d ZL is equal to WL plus 1

Â transpose DZ l plus 1 times G L prime Z

Â FL I know this is a looks like a lot of

Â algebra and you can actually double

Â check for yourself that this is the

Â equation where I've written down for

Â back propagation last week when we were

Â doing in your network with just a single

Â hidden layer and as you reminder this

Â times this element-wise product but so

Â all you need is those four equations to

Â implement your backward function and

Â then finally I'll just write out the

Â vectorized version so the first line

Â becomes DZ l equals e a l element-wise

Â product with GL prime of z l may be no

Â surprise there DW l becomes 1 over m DZ

Â l times a L minus 1 transpose and

Â in dbl becomes one over m MP dot some

Â easy L then accrues equals one

Â keep dims equals true we talked about

Â the use of MP dot some in the previous

Â week to compute DB and then finally da L

Â minus one is WL transpose times T Z L so

Â this allows you to input this quantity

Â da over here and output on DW l DP l the

Â derivatives you need as well as da L

Â minus 1 right as follows so that's how

Â you implement the backward function so

Â just to summarize um take the input X

Â you might have the first layer maybe has

Â a regular activation function then go to

Â the second layer maybe uses another

Â value activation function goes to the

Â third layer maybe has a sigmoid

Â activation function if you're doing

Â binary classification and this outputs

Â one hat and then using Y hat you can

Â compute the loss and this allows you to

Â start your backward integration I draw

Â the arrows first I guess I don't have to

Â change pens too much where you were then

Â have back prop compute the derivatives

Â compute your d w3 t p3 d w2 t p2 d w1 t

Â b1 and along the way you would be

Â computing I guess the cash will transfer

Â z1 z2 z3 and here are you pass backward

Â da 2 and da

Â this could compute da zero but we won't

Â use that so you can just discard that

Â and so this is how you implement forward

Â prop and back prop for a three-layer

Â your network now there's just one last

Â detail delight didn't talk about which

Â is for the forward recursion we would

Â initialize it with the input data X how

Â about the backward recursion well it

Â turns out that um D a of L when you're

Â using logistic regression when you're

Â doing binary qualification is equal to Y

Â over a plus 1 minus y over 1 minus a so

Â it turns out that the derivative of the

Â loss function respect to the output

Â we're expected Y hat can be shown to be

Â equal to this so if you're familiar with

Â calculus if you look up the loss

Â function L and take derivatives respect

Â to Y I have a respect to ay you can show

Â that you get that formula so this is the

Â formula you should use for da for the

Â final layer capital L and of course if

Â you were to have a vectorized

Â implementation then you initialize the

Â backward recursion not with this there

Â will be a capital A for the layer L

Â which is going to be you know the same

Â thing for the different examples right

Â over a for the first training example

Â plus 1 minus y for the first training

Â example over 1 minus 8 for the first

Â training example not down to the M

Â training example 1 minus a of M so

Â that's how you taught implement the

Â vectorized version that's how you

Â initialize the vectorized version of

Â background brocation so you've now seen

Â the basic building blocks of both for

Â propagation as well as back propagation

Â now if you implement these equations you

Â will get a correct implementation of

Â board prop and back prop to get to the

Â derivatives unique you might be thinking

Â well there's a lot equations I'm

Â slightly confused I'm not quite sure I

Â see how this works and if you're feeling

Â that way my advice is when you get to

Â this week's programming assignment you

Â will be able to implement these for

Â yourself and there'll be much more

Â concrete and I know there's a lot of

Â equations and maybe some equations in me

Â complete sense but if you work through

Â the calculus and the linear algebra

Â which is not easy so you know feel free

Â to try but that's actually have one is

Â more difficult derivations in machine

Â learning it turns out the equations

Â wrote down that just the calculus

Â equations for computing the derivatives

Â especially in backdrop but once again if

Â this C is well bit ass check will be

Â mysterious to you my advice is when

Â you've done there provide exercise it

Â will feel a bit more concrete to you

Â although I have to say you know even

Â today when I implement a learning

Â algorithm sometimes even I'm surprised

Â when my learning algorithm

Â implementation works and it's because

Â longer complexity of machine learning

Â comes from the data rather than from the

Â lines of code so sometimes you feel like

Â you implement a few lines of code not

Â question what it did but there's almost

Â magically work and it's because of all

Â the magic is actually not in the piece

Â of code you write which is often you

Â know not too long it's not it's not

Â exactly simple but there's not you know

Â 10,000 100,000 lines of code but you

Â feed it so much data that sometimes even

Â though I work the machine only for a

Â long time sometimes it's so you know

Â surprises me a bit when my learning

Â algorithm works because lots of

Â complexity of your learning algorithm

Â comes from the data rather than

Â necessarily from your writing you know

Â thousands and thousands of lines of code

Â all right so that's um how do you

Â implement deep neural networks and again

Â this will become more concrete when

Â you've done the priming exercise before

Â moving on I want to discuss in the next

Â video want to discuss hyper parameters

Â and parameters it turns out that when

Â you're training deep nets being able to

Â organize your hyper params as well will

Â help you be more efficient in developing

Â your networks in the next video let's

Â talk about exactly what that means

Â