Hi. In this video,

we will take a look at other cases of matrix derivatives,

and we will start with vector function.

A vector function is such function that takes a vector as an input,

and outputs a vector as well.

We can actually take a composition of these functions,

and we will get another vector function.

This will be very useful for recurrent neural network further in our specialization.

What we're interested in right now,

is how to find derivative of every output of our function with respect to every input.

You can actually see that we can place all these derivatives in a matrix,

where every respective element is precisely the partial derivative we need.

This matrix is called the Jacobian,

and let's see why it is so useful.

Let's try to find how we can compute dh_i with respect to dx_ j,

and we can just apply a chain rule here.

You can actually see that our chain rule becomes precisely,

the definition of matrix multiplication.

So, that is why it is so useful.

Because for vector functions,

a chain rule is pretty simple.

You just take a product of the Jacobian matrices.

Now, let's see how we can find the derivative of

a product of two matrices with respect to a matrix a.

To do that, you will actually have to find

all the partial derivatives of

all your outputs with respect to every element of matrix a.

So, you can see that for that,

you will need some structure of a four dimension.

So, you will need a four-dimensional array.

For that, we have a different name, a tensor.

So, our partial derivative will be a tensor.

Let's see how we can apply chain rule in this case.

I should notice that there is no a common notation for these tensor derivatives,

so we will use one possible notation.

Let's say that, to find dC, dA,

we will first find all the derivatives

of every output with respect to the whole matrix a.

We can already do that because,

every element of matrix C is a scalar,

and we know how to find the derivative of scalar with respect to the whole matrix.

It will be just a matrix.

You can actually see that here,

we will have a matrix of matrices.

So, this is something of a higher dimension.

Let's say that we already have the dL, dC,

our incoming gradient of our scalar loss with respect to

matrix C. That's why it has the same shape as matrix C. Now,

let's apply a chain rule.

So, let's try to find dL, dA.

To do that, you will have to pass through every element of

matrix C during your chain rule computation.

You could actually see that our chain rule becomes just a linear combination of dC, dA,

for every possible C. You can actually make sure that it works in our case,

and these chain rule is still applicable.

But, there are some issues though.

You can actually see that to apply our chain rule here,

you will have to crunch a lot of zeros.

So, it will not be very fast,

you're wasting time here.

Another problem is that you might not have

a high-performance procedure to find a linear combination of matrices in any framework.

So, it might become a problem as well.

But luckily, we don't need to do that,

because we can find dL,

dA directly, skipping dC, dA tensor computation.

We just don't need that tensor,

and we already know how to do that.

If you recall our MLP lecture,

you will actually remember that to find the dL,

dA, you need to take a product of dL,dC, and B transpose.

Because, our dense layer in our MLP is precisely a matrix product.

So, we already know how to find the derivative of matrix products.

A good property of these rule is that,

right now, we're taking a product of small matrices.

We're taking a matrix product which is efficient on GPU and CPU,

and you can actually make sure that it yields precisely the same result.

So, this is pretty cool.

Thankfully, in deep learning frameworks,

we need to calculate gradients of a scalar loss with respect to another scalar,

vector, matrix, or even tensor.

This is how tf.gradients works in TensorFlow framework for example.

Deep learning frameworks have optimized versions of backward pass for standard layers,

like a dense layer.

You can actually understand right now,

that we have an incoming gradient in our backward pass interface for a reason.

Because, usually, our loss is a scalar,

and we need to transform our incoming gradient.

You can actually see that,

for example in the case of MLP,

we can do that efficiently skipping a tensor variable computation.

So, this is pretty neat,

it is already optimized for you.

So, the takeaways of this video are the following: for vector functions,

the chain rule says that you need to multiply respective Jacobians.

Each Jacobian is a matrix,

and the matrix product is efficient.

A matrix by matrix derivative is a tensor though.

The chain rule for such tensors is not very useful in practice.

It is not very efficient,

it takes a lot of space,

and it stores redundant information.

But thankfully, in deep learning frameworks,

we usually need to track gradients of a scalar loss with respect to all other parameters.

That's why you can do that efficiently,

and you can skip that tensor derivative computation.

In the next video I will introduce you a TensorFlow framework.