0:00

[MUSIC]

In this video, we'll discuss how to combat exploding and

vanishing gradients in practice, let's start with the exploding gradient problem.

In the previous video,

we've learned that this problem occurs when the gradient norms become large, so

possibly even NaNs, which makes training of the recurrent neural network unstable.

0:22

The resulting instability is actually very easy to detect.

On the slide, you can see the learning curve of some neural network.

The training iterations are on the x axis, and

the loss on the training data is on the y axis.

This particular neural network doesn't suffer from the exploding gradient

problem.

Therefore, the loss decreases with the number of iterations, and

the training is stable.

But sometimes you can see spikes in the learning curve, like this one,

this a result of the exploding gradient problem.

The gradient explodes, and you make a long step in the parameter space.

As a result, you may end up with a model with quite random weights, and

high training costs.

In the worst case, the gradient may even become not a number, and

you may end up with not numbers in the weights of the neural network.

The most common way to combat this problem is gradient clipping,

this technique is very simple, but still very effective.

1:19

If the network suffers from the exploding gradient problem,

then the gradient of the loss respect to all the parameters of the network.

So let's simply clip this one, it's our threshold, by doing this,

we don't change the direction of the gradient, we only change its length.

1:36

Actually, we can clip not the norm of the whole gradient vector, but

just the norm of the part which causes the problem, do you remember which part is it?

Yeah, it is a Jacobian matrix of hidden units at one timestamp,

with respect to hidden units in the previous timestamp.

So it is enough to clip just the value of the Jacobian at each timestamp.

1:57

Okay, and how to chose the threshold for gradient clipping?

We can choose it manually, so we start with a large threshold and

decrease it while the network doesn't suffer from exploding gradient problem.

Also, we can look at the norm of the gradient

at a sufficient number of training iterations.

And choose the threshold in such a way that we clip unusually large values.

There is another interesting technique which can help us

to overcome the exploiting gradient problem.

It is not designed specifically for this purpose, but it still can be helpful.

As you remember, when we do back propagation through time,

we need to make a forward pass through the entire sequence to compute the loss.

And then we need to make a backward pass through the entire sequence to compute

the gradient.

2:41

If our training sequences are quite long, then this is a very computationally

expensive procedure, and additionally, we may have the exploding gradient problem.

So let's run forward and

backward passes through the chunks of the sequence, instead of the whole sequence.

In this case, we first make forward and

backward passes through the first chunk and store the last hidden state.

Then we can go to the next chunk,

and start forward pass from the hidden state we stored.

Then we make forward and backward passes through the second chunk,

store the last hidden state, and go to the next chunk.

And so on, while we don't reach the end of the sequence.

So we carry hidden states forward in time forever, but only backpropagate for

some smaller number of steps.

3:22

This algorithm is called truncated backpropagation through time, and

it is much faster than the usual backpropagation through time.

It also doesn't suffer from the exploding gradient problem that much, since we don't

take into account the contributions to the gradient from faraway steps.

But of course, these advantages do not come without a price.

Dependencies that are longer than the chunk size don't affect the training, so

it's much more difficult to rank long range dependencies with these algorithms.

Now let's speak about the vanishing gradient problem.

From the previous video, we know that these problem occurs when

the contributions to the gradient from faraway steps becomes small.

Which makes the learning of long range dependencies very difficult.

This problem is more complicated than the exploding gradient problem.

It is difficult to detect the vanishing gradient problem, and

there is no one simple way to overcome it, let's start with the detection.

In the case of the exploding gradient problem,

we had a clear indication that it occurred.

We saw the spikes in the learning curve, and in the curve with the gradient norm.

But in the case of vanishing gradient problem, the learning curve and

the number of the gradient look quite okay.

I mean, from the learning curve,

you may only see that the loss of the network is not that good.

And it's not clear whether this is due to the vanishing gradient problem or

because the task itself is difficult.

4:42

And if you look, for example, at the gradient of the most intensive term,

with respect to faraway units, you may see that the norm of this gradient is small.

But it could because there are no longer any dependencies in the data.

4:54

So we can be sure that there is a vanishing gradient problem in the network,

only when we overcome it and see that the network works better.

There are a lot of different techniques to deal with the vanishing gradients.

The most common approach is to use specifically designed recurrent

architectures, such as long short-term memory, or LSTM, and

gated recurrent unit, or GRU.

These architectures are very important, so we'll speak about them in a separate

video later this week, and now we will briefly discuss some other ideas.

5:28

As you already know, the Jacobian matrix, which may cause the problem of vanishing

and exploding gradients, depends on the choice of the activation function and

the values of your recurrent weights, W.

5:41

So if we want to overcome the vanishing gradient problem,

we should try to use the rectified linear units activation function, which is much

more resistant to this problem, and what about the recurrent weight matrix W?

From linear algebra, you may remember that an orthogonal matrix is a square matrix,

such as its transpose is equal to its inverse.

Orthogonal matrices have many interesting properties, but

the most important for us, is that all the agent values of an orthogonal

matrix have absolute value of 1.

This means that no matter how many times we perform repeated matrix multiplication,

the resulting matrix doesn't explode or vanish.

6:21

Therefore, if we initialize the recurrent weight matrix W with an orthogonal matrix,

the second part of the Jacobian doesn't cause the vanishing gradient problem,

at least on the first iterations of the training.

6:41

There are some approaches that utilize the properties of orthogonal matrices,

not just for a proper initialization, but also for

the parameterization of the weights for the whole training process.

But these methods are out of scope of this course.

The last idea that we will discuss in this video is the idea of using skip

connections.

In a recurrent neural network, we can't carry the contributions through

the gradient through a lot of timestamps, because at each step,

we need to multiply them by the Jacobian matrix, and as a result, they vanish.

7:12

Let's add shortcuts between hidden states that are separated by more than

one timestamp.

These shortcuts are usual connections, with their own parameter matrices.

By using them, we create much shorter ways between faraway timestamps in the network.

So when we backpropagate the gradients along these short ways,

they vanish slower, and we can learn longer dependencies with such a network.

7:49

Yeah, I spoke about the residual network,

which contains a date shortcut connections in each where.

We can also of course use the shortcuts in recurrent neural networks,

as well as known identity shortcuts in big form architectures for

the computer vision tasks.

8:15

Gradient clipping is a simple method to combat the exploding gradients.

And truncated backpropagation through time also can help us with this,

in addition to the acceleration of the training.

To overcome the vanishing gradient problem, we can use several methods,

including careful choice of the activation function,

proper initialization of the recurrent weights, and

modification of the network with additional skip connections.