0:00

In the last video,

we learned how gradient descent works for the case of a single neural network.

Then, we wondered how gradient descent should work for

feedforward neural networks that have many layers.

If you use such networks,

we need to train adjustable parameters in these networks.

But large networks may have many layers of neurons,

with up to hundreds of thousands,

or even millions of parameters.

To make gradient descent work in practice in such settings,

we need some sort of numerical optimization of calculation of that many gradients.

Such algorithm called back-propagation that allows

gradient descent to work efficiently with large neural networks was

suggested in 1986 in a groundbreaking paper by Rumelhart, Hinton and Williams.

As its name suggests,

backpropagation works backwards from

outputs to the inputs using the channel of derivative recursively.

This may already sound familiar to you from our previous video about the TensorFlow,

and how it implements the reverse-mode autodiff for

automatic calculation of derivatives of arbitrary functions.

And if it does,

it does it for the right reason,

because backpropagation is exactly gradient descent where all derivatives are computed,

used a reverse model autodiff method.

To see how it works in details,

let's recall the working of the reverse-mode autodiff and TensorFlow.

In our video on TensorFlow,

we saw how it works on a simple example of a function of two variables X and Y.

The main idea there was a combination of a forward and backward pass

and a reliance on a chain rule for calculation of derivatives.

Now, let's see how essentially the same method

works to calculate gradients or for a neural network,

with respect to all of its parameters.

Assume that we minimize a mean squared expected loss

for a train set as we did for linear regression.

But this time, the function Y hat of W

is given by the final node, fN of some neural network.

For example, we might have a neural network with two hidden layers of this type.

We can schematically write the output fN of

such neural network as a compound function shown here.

The best way to understand this formula is from outputs to the inputs.

First function fN depends on parameters WN,

and on inputs to this function.

But the inputs to fN,

are given by outputs of the previous layer,

which is denoted f sub N, minus one here.

This function depends on its own vector of parameters WN,

W sub N minus one, and so on.

Now, let's see how the chain rule works

backwards from the top of the network to compute all derivatives recursively.

Let's start with the last node fN.

There are two weights,

W sub N one,

and W sub N two,

that enter the final node fN of this network.

Computing the gradient with respect to either weight W sub I,

where I equals one or two is done as a straightforward application of the chain rule.

So we have the derivative of the lost function f,

With respect to W sub Ni,

is given by a product of the derivative of N with

respect to the output fN of the last node,

times the derivative of the last node with respect to weight, W sub Ni.

Let's write it as a product of delta N,

times the derivative of fN with respect to the input weight, W sub Ni,

where Delta N is the derivative of

the lost function f with respect to the output of node fN.

In this expression, the term delta N

depends on the laws observed for this set of parameters,

while the second factor does not depend on the laws but only

depends on the form or function fN and weights used there.

Now let's continue going backwards and consider the last hidden layer,

L sub N minus one.

For derivatives with respect to weights W sub N minus one,

that specify functions fN minus one here,

we again apply the chain rule to express it as a product of these three derivatives.

First, we take the derivative of N with respect to fN,

then the derivative of fN with respect to f sub N minus one,

comma I, and then finally,

the derivative of f sub N minus one,

with respect to weights W sub N minus one for i equal one or two.

But let's note a nice trick here.

Let's denote these product of

directive f with respect to fN and fN with respect to f sub N minus one,

as delta sub N minus one comma i.

As we already computed delta sub N,

then we can easily evaluate the expression for delta sub N minus one.

Now, know that by introducing delta sub N minus one as we just did that

derivatives of the lost f with respect to

weights W sub N minus one,

have exactly the same form as in

the previous expression with the only difference that now we have index and minus one,

instead of N. So,

instead of the error delta N at the last layer,

we have a new error delta sub N minus one,

made of delta N and the derivative of fN,

with respect to f sub N minus one.

In other words, that error,

delta N has backpropagated to level N minus one from the last layer LN,

and became delta sub N minus one.

Otherwise, calculation of gradients of the final lost function with respect to weights of

the lost human layer is the same as for the very last layer.

Now, we are already starting to see a pattern here.

Let's see how we can continue such recursive calculation for the layer L sub N minus two.

If we want to calculate the derivative of the lost function with respect

to weight WN sub N minus two comma two,

we have to include both nodes at the higher level L sub N minus one.

These are nodes shown here.

So the derivative will include some of two terms,

each having a product of four derivatives in it.

But we already know that this product is equal to delta sub N minus one.

That's this. So, the whole expression can be written

again in a very similar form to the previous expression as

a product of delta sub N minus two times these derivative.

Again, the error for this layer is given by

linear combination of delay of the error from the previous layer times those derivatives.

These calculation can be continued for all levels and all weights over a neural network.

This produces a fast recursive calculation

of all derivatives with respect to all weights.

Once they are all calculated,

we can use the gradients to perform one step of gradient descent.

For the next step,

the whole procedure is repeated again until conversions.

So, we saw in this video that backpropagation

of the training course provides the most practical approach to

using gradient descent with

neural network that might have a very large number of weight parameters.

Now, if we think of a software implementation

of backpropagation in terms of a computational graph,

made of nodes representing weights and activation functions of a neural network,

then we immediately realized that in TensorFlow,

it's actually available to us via TensorFlow's autodiff functionality.

Now, we are almost ready to start playing with neural nets,

gradient descent, and backpropagation, all in TensorFlow.

Only one step remains,

which is to see how it works for real world data sets that tend to be large.

It turns out that the version of Gradient Descent method quotes

the casting gradient descent is best suited for such tasks.

Let's see how it works in the next video.