In this video, we will discuss

two main problems that arise in training Recurrent Neural Networks,

the problems of exploding and vanishing gradients.

As you already know,

we can use the backpropagation algorithm to train a recurrent neural network,

but in this case, we backpropagate

gradients not only through layers but also through time.

As an example in the previous video,

we derived the expression for the gradient of the loss L with

respect to the current weight matrix W. And now we know,

that to compute the gradient of the loss at time step t

with respect to W, we should sum up

the contributions from all the previous time steps to this gradient.

Now let's look at the expression for the gradient more closely.

As you can see. There is a product of Jacobian matrices in each term of the sum.

And if we look at the particular term which corresponds to

the contributions from some k to the gradient

at time step t, then we see that the more steps between time moments k and t,

the more elements are in this products.

So the values of Jacobian matrices have

a very strong influence especially on the contributions from faraway steps.

Let's suppose for a moment that we have only one hidden unit in our network.

So h is a scalar.

Then all the elements in the expression for the gradient are also scalars.

So the gradient itself is a scalar and all the Jacobian matrices are scalars and so on.

In this case, it is clear that if all the Jacobian matrices,

now Jacobian scalars, are less than one in absolute value,

then their product goes to zero

exponentially fast when the number of elements in this product tends to infinity.

And on the contrary, if all the Jacobian scalars are more than one in absolute value,

then their product goes to infinity exponentially fast.

As a result, in the first case,

the contributions from the faraway steps go to zero and

the gradient contains only the information about nearby steps.

Thus it is difficult to learn long range dependencies

with a simple recurrent neutral network.

This problem is usually called the vanishing gradient problem,

because a lot of elements in the gradient simply vanish and don't affect the training.

In the second case the contributions from

faraway steps grow exponentially fast so the gradient itself grows too.

If an input sequence is long enough

the gradient may even become a not-a-number in practice.

This problem is called the exploding gradient

problem and it makes the training very unstable.

The reason is that if the gradient is

a large number then we make

a long step in the direction of this gradient in the parameter space.

Since we optimize a very complex multimodal function and we use stochastic methods

we may end up in a very poor point after such step.

OK. We have discussed the simplified case,

now let's return to the real life.

A recurrent neural network usually contains not just one hidden unit,

but the whole vector of them.

Consequently, the Jacobian matrices are really matrices, not scalars.

We can apply the same reasoning here.

But instead of the absolute value,

we need to use the spectral matrix norm which is equal to

the largest singular value of the matrix.

If all the Jacobian matrices in the product have the norms which are less than one,

then the gradient vanishes.

And if all the Jacobian matrices have the norms which are higher than one,

then the gradient explodes.

Now we know that values

of the Jacobian matrices are crucial in training a recurrent neural network.

So lets see what values they have in practice.

As you remember, the hidden units at time step t,

can be computed by applying some nonlinear function f to

a linear combination of inputs at this time step and hidden units at the previous time. step,

Let’s denote this linear combination by preactivation t.

To compute the Jacobian matrix of the hidden units at time step t

with respect to the hidden units in the previous time step,

we can use the chain rule.

So we first compute the Jacobian of h with respect to its preactivation and

then compute the Jacobian matrix

of this preactivation with respect to the previous hidden units.

Since f is an element-wise nonlinearity,

the first Jacobian here is a diagonal matrix with the derivatives of f in the diagonal.

And how to compute the second Jacobian? This is the question for you.

Since the preactivation is a linear combination of some elements,

the second Jacobian consists of the weights

of the hidden units in this linear combination.

So it is equal to the weight matrix W. Now lets look at

the both parts of the Jacobian matrix of the hidden unit h_t

with respect to the hidden units h_{t-1} separately.

The first part depends on the type of nonlinearity we use.

The usual choice of nonlinearity for neural networks

is a sigmoid or hyperbolic tangent or rectified linear unit functions.

As you can see on the left part of the slide,

all these nonlinearities are very flat in the large part of the input space.

So the sigmoid and hyperbolic tangent are

almost constant for both small and large inputs,

and the rectified linear unit is equal to zero for all the negative inputs.

The derivatives of this nonlinearities are very

close to zero in the regions where they are flat.

So as you can see on the right part of this slide

the derivatives of the sigmoid and hyperbolic tangent are less than one

almost everywhere.

And this may very likely cause

the vanishing gradient problem. And the situation

with rectified linear unit is much better, at least for positive inputs

its derivative is equal to one.

But the gradients may still vanish because of the zero

derivative in the negative part of the input space.

Now let's look at the second part of the Jacobian ofh. Weight matrix is

a parameter of the model so its norm may be either large or small.

The small norm could aggravate the vanishing gradient problem and

the large norm could cause the exploding gradient problem

especially in the combination with rectified linear unit nonlinearity.

OK let's summarize what we have learned in this video.

Recurrent Neural Networks have sequential nature so they are very deep in time.

Therefore the vanishing and exploding gradient problems may arise during the training of them.

Actually these problems are not exclusive for recurrent neural networks,

they also occur in deep feedforward networks.

Vanishing gradients make the learning of long-range dependencies very

difficult. And exploding gradients

make the learning process unstable and may even crash it.

In the next videos, we will discuss

different methods that can help us to overcome these issues.