To apply SGD with mini batches, we actually need to take the loss for

all the elements in our batch, right?

So for SGD, step we have a lost on our batch.

Let's identify it as Lb.

And we actually need to calculate the derivative of that lost with respect

to our parameters W.

And to do that, you need to apply that gradient to every output that we have and

you just sum them up.

So we just applied the rule here,

that the derivative of the sum is the sum of the derivatives.

Now, let's look at one summoned in that sum.

Let's see how we can calculate the derivative of

loss function with respect to Wij.

And we already know how to do that because we have already done that like two

slides before.

And let's use this notation to come up with the rule for

the mini batch Backward pass.

For two samples, you can actually see that to calculate dLb,

dWij, you need to apply that known rule two times.

So you take dLd, zi1j, xij and so forth.

So you can actually see that what we got is really similar to dot product, right?

And matrix multiplication is all about dot product.

So maybe we can come up with some matrixes that will give us this

result in terms of matrix multiplication.

And you can actually find those matrices.

You can actually see that dLbdw can be compute in a matrix

notation just taking x transpose and multiplying it by dLdz.

Where dLdz is a known thing, it's a derivative

of scalar with the respect of the matrix and we know how to compute that.

You just do that element twice.

X tranpose is a simple thing as well.

You just replace rows with columns and here you are.

Now, let's just check that this rule actually works.

Let's check it for W3,2.

Let's check that if you take the third row from matrix X transpose and

take a dot product with the second column of matrix dLdZ.

That will yield you the formula that we have come up with.

You can pause the video and check that it actually is correct, so it works.

Unfortunately, you should also calculate the derivative of loss

function with respect to x and this is where it is a little bit tricky.

Let's apply a chain rule, so the approach is standard,

let's try to apply a chain rule element wise.

So let's take for example, object i and

let's try to calculate the derivative of a loss

on that object with respect to some feature j of that object.

So how to do that?

Let's apply chain rule.

So first to go to the X, we need to go through Z, right?

Because a chain rule is just a path in the graph.

So let's write that out, and

let's notice that dzik or dxij is actually jjk.

So it is thanks to the fact,

that z is just a linear combination of values futures X.

Then, let's replace w with w transpose and

let's swap the indices of that element.

Then, you can actually see that to calculate our batch

loss with respect to X, you need to calculate dLdZ and

multiply it by W transpose.

Why does that work?

Because you can notice that to make an you need to take a sum of losses,

of derivatives of our losses with respect to matrix X, for every element, right?

For every instance in our batch.

And you can also notice that each instance actually

gives you only one non-zero row in dLdxij.

Because that instance is only dependent on its own features, right?

So that means that effectively, when we are doing a sum of those rows,

we're not making a sum.

We just take all that rows and stack them in a matrix.

So that's why this matrix notation really works.

So this is pretty cool.

You can see that you can apply Backward pass and Forward pass for

mini batches or just for one instance, pretty efficiently.

You can do that with matrix multiplication and you can do that with numpy.

Let's see on one side, just let's summarize what we have come up with.

And now to implement that in numpy.

The Forward pass for a dense layer is done pretty easily.

If Forward pass just gets all your inputs and that is features and waits.

And it takes just the dot product.

So it takes the matrix multiplication because that is how we

do the forward pass.

Backward pass is pretty easy as well.

And let me remind you that in the interface of backward pass,

we also give the incoming gradient.

And that is where it becomes pretty good because we need that

incoming gradient to calculate dx and dw efficiently.

And we actually do write those formulas that you can actually use dL dz and

multiply it, either by double the transpose or

x transpose to get the derivatives of x or w.

So this is implemented pretty efficiently with numpy, as well.

And you can notice one more reason why we use dL dZ in backward pass interface.

Because otherwise, we would have to calculate, for example, dz dx.

And this is something scary because z and x are both matrices.

And it's not clear how to calculate the derivative of matrix

with respect to the other matrix, it's a no go.

So we should have an incoming gradient, and thanks to the incoming gradient,

we can do this efficiently with matrix multiplication.

To summarize, you can do forward pass for

a dense layer with a matrix multiplication.

You can do a backward pass with matrix multiplication as well.

And this is pretty cool, and this is where GPU comes into play.

Because on GPUs, you can crunch matrices pretty fast.

What's more, it's easy to code with numpy and as a matter of fact,

we have an honor assignment for those of you who want to do that with numpy.

In the next video, we will take a quick look at other matrix derivatives.

They're scary, but you should know about them.

[SOUND]