0:02

In this video, we will discuss

two main problems that arise in training Recurrent Neural Networks,

the problems of exploding and vanishing gradients.

As you already know,

we can use the backpropagation algorithm to train a recurrent neural network,

but in this case, we would backpropagate

gradients not only through layers but also through time.

As an example in the previous video,

we derived the expression for the gradient of the loss L with

respect to the current weight marked as W. And now we know,

that to compute the gradient of loss at time step t,

respect to W. We should sum up

the contributions from all the previous time step to these gradient.

Now let's look at the expression for the gradient more closely.

As you can see. There is a product of Jacobian matrices in each turn of the sum.

And if we look at the particular tool which corresponds to

the contributions from sum sub k to the gradient of

time step t. Then we see the more steps between time moments k and t,

the more elements are in this products.

So the values of Jacobian matrices have

a very strong influence especially on the contributions from faraway steps.

Let's suppose for a moment that we have only one hidden unit in our network.

So h is a scalar.

Then all the elements in the expression for the gradient are also scalars.

So the gradient itself is a scalar and all the Jacobian matrices are scalars and so on.

In this case. it is clear that if all the Jacobian matrices,

now Jacobian scalars are less than one in absolute value,

then their products goes to zero

exponentially faster than the number of elements in this product tends to infinity.

And on the contrary, if all the Jacobian scores are more than one in absolute value,

then the product goes to infinity exponentially fast.

As a result, in the first case,

the contributions from the following steps go to zero and

the gradient contained under the information about nearby steps.

This is difficult to know long range dependency,

so this is a simple recurrent neutral network.

This problem is usually called the vanishing gradient problem,

because a lot of elements in the gradient simply vanish and don't affect the training.

In the second case the contributions from

prior steps grow exponentially fast so the gradient itself grows too.

If an input sequence is long

enough the gradient may even become not a number in practice.

This problem is called the explosion gradient

problem and it makes the training very unstable.

There is an if that if the gradient is

a large number then we make

a long step in the direction of this gradient in the parameter space.

Since we optimize a very complex multi-model function and we use the drastic methods.

We may end up in a very poor point after such step.

OK. We have discussed the simplified case,

now let's return to the real life.

A recurrent neural network usually contains not just one hidden unit,

but the whole vector of them.

Consequently, the Jacobian matrices are really matrices, not scalars.

We can apply the same reasoning here.

But instead of the absolute value,

we need to use the spectral matrix now which is equal to

the largest singular value of the matrix.

If all the Jacobian matrices in the product have the norms which are less than one,

then the gradient finishes.

And if all the Jacobian matrices have the norms which have higher than one,

then the gradient explodes.

Now we know the values

of the Jacobian matrices are crucial in training a recurrent neural network.

So lets see what values they have in practice.

As you remember, the hidden units at time step t,

can be computed by applying some nonlinear function f to

a linear combination of inputs at the startup and hidden units and the previous timestep,

let us denote this linear combination by prioritization t.

To compute the Jacobian matrix of the hidden units attempts at time step T,

with respect to the hidden units in the previous time step,

we can use the chain rule.

So the first compute the Jacobian of H with respect to its preactivation and

then compute the Jacobian matrix

for this preactivation with respect to the previous hidden units.

Since F is an element twice on minority,

the first Jacobian here is a diagonal matrix with the derivatives of F in the diagonal.

And how to compute the second Jacobian? This is the question for you.

Since the preactivation is a linear combination of some elements,

the second Jacobian consists of the weights

of the heathen units in this linear combination.

So it is equal to the weight matrix W. Now lets look at

the best parts of the Jacobian matrix of the hidden unit H T

with respect to the hidden units H T minus one separately.

The first part depend on the type of non-linearity we use,

the usual choice of non-linearity for neural networks.

In this segment of hyperbolic tangent or rectified linear unit functions.

As you can see on the left part of the slide,

all these small linearites are very flat in the large part of the input space.

So the sigmoid and hyperbolic tangent are

almost constant for both small and large inputs,

and the rectified linear units is equal to zero for all the negative inputs.

The derivatives of this non-linearity are very

close to zero in the regions where where they are flipped.

So as you can see on the right part of this slide

the derivative of the sigmoid and hyperbolic tangent are less than one.

Almost everywhere.

And this may very likely cause

the vanishing gradient problem and the situation

with rectified linear unit is much better at least for positive inputs,

its derivative is equal to one.

But the gradients may seem vanished because of the zero

derivative in negative part of the input space.

Now let's look at the second part of the Jacobian of H. The value of metrics is

a parameter in all of the model so its norm would be either large or small.

The small norm could aggravate the vanishing gradient problem and

the large norm could cause the exploding gradient problem

especially in the combination with rectified linear unit non-linearity.

OK let's summarize what we have learned In this video.

Recurrent Neural Networks have sequential nature so they are very deep in time

before the invasion and exploiting graden problems may arise during the training of them.

Actually these problems are not exclusive for recurrent neural networks,

which also occur in deep feedforward networks.

Vanishing gradients make the learning of long-range dependencies very

difficult and exploding gradients

make the learning process unstable and may even crash it.

In the next videos, we will discuss

different methods that can help us to overcome these issues.