0:00

[MUSIC]

Â In this video, we'll discuss how to combat exploding and

Â vanishing gradients in practice, let's start with the exploding gradient problem.

Â In the previous video,

Â we've learned that this problem occurs when the gradient norms become large, so

Â possibly even NaNs, which makes training of the recurrent neural network unstable.

Â 0:22

The resulting instability is actually very easy to detect.

Â On the slide, you can see the learning curve of some neural network.

Â The training iterations are on the x axis, and

Â the loss on the training data is on the y axis.

Â This particular neural network doesn't suffer from the exploding gradient

Â problem.

Â Therefore, the loss decreases with the number of iterations, and

Â the training is stable.

Â But sometimes you can see spikes in the learning curve, like this one,

Â this a result of the exploding gradient problem.

Â The gradient explodes, and you make a long step in the parameter space.

Â As a result, you may end up with a model with quite random weights, and

Â high training costs.

Â In the worst case, the gradient may even become not a number, and

Â you may end up with not numbers in the weights of the neural network.

Â The most common way to combat this problem is gradient clipping,

Â this technique is very simple, but still very effective.

Â 1:19

If the network suffers from the exploding gradient problem,

Â then the gradient of the loss respect to all the parameters of the network.

Â So let's simply clip this one, it's our threshold, by doing this,

Â we don't change the direction of the gradient, we only change its length.

Â 1:36

Actually, we can clip not the norm of the whole gradient vector, but

Â just the norm of the part which causes the problem, do you remember which part is it?

Â Yeah, it is a Jacobian matrix of hidden units at one timestamp,

Â with respect to hidden units in the previous timestamp.

Â So it is enough to clip just the value of the Jacobian at each timestamp.

Â 1:57

Okay, and how to chose the threshold for gradient clipping?

Â We can choose it manually, so we start with a large threshold and

Â decrease it while the network doesn't suffer from exploding gradient problem.

Â Also, we can look at the norm of the gradient

Â at a sufficient number of training iterations.

Â And choose the threshold in such a way that we clip unusually large values.

Â There is another interesting technique which can help us

Â to overcome the exploiting gradient problem.

Â It is not designed specifically for this purpose, but it still can be helpful.

Â As you remember, when we do back propagation through time,

Â we need to make a forward pass through the entire sequence to compute the loss.

Â And then we need to make a backward pass through the entire sequence to compute

Â the gradient.

Â 2:41

If our training sequences are quite long, then this is a very computationally

Â expensive procedure, and additionally, we may have the exploding gradient problem.

Â So let's run forward and

Â backward passes through the chunks of the sequence, instead of the whole sequence.

Â In this case, we first make forward and

Â backward passes through the first chunk and store the last hidden state.

Â Then we can go to the next chunk,

Â and start forward pass from the hidden state we stored.

Â Then we make forward and backward passes through the second chunk,

Â store the last hidden state, and go to the next chunk.

Â And so on, while we don't reach the end of the sequence.

Â So we carry hidden states forward in time forever, but only backpropagate for

Â some smaller number of steps.

Â 3:22

This algorithm is called truncated backpropagation through time, and

Â it is much faster than the usual backpropagation through time.

Â It also doesn't suffer from the exploding gradient problem that much, since we don't

Â take into account the contributions to the gradient from faraway steps.

Â But of course, these advantages do not come without a price.

Â Dependencies that are longer than the chunk size don't affect the training, so

Â it's much more difficult to rank long range dependencies with these algorithms.

Â Now let's speak about the vanishing gradient problem.

Â From the previous video, we know that these problem occurs when

Â the contributions to the gradient from faraway steps becomes small.

Â Which makes the learning of long range dependencies very difficult.

Â This problem is more complicated than the exploding gradient problem.

Â It is difficult to detect the vanishing gradient problem, and

Â there is no one simple way to overcome it, let's start with the detection.

Â In the case of the exploding gradient problem,

Â we had a clear indication that it occurred.

Â We saw the spikes in the learning curve, and in the curve with the gradient norm.

Â But in the case of vanishing gradient problem, the learning curve and

Â the number of the gradient look quite okay.

Â I mean, from the learning curve,

Â you may only see that the loss of the network is not that good.

Â And it's not clear whether this is due to the vanishing gradient problem or

Â because the task itself is difficult.

Â 4:42

And if you look, for example, at the gradient of the most intensive term,

Â with respect to faraway units, you may see that the norm of this gradient is small.

Â But it could because there are no longer any dependencies in the data.

Â 4:54

So we can be sure that there is a vanishing gradient problem in the network,

Â only when we overcome it and see that the network works better.

Â There are a lot of different techniques to deal with the vanishing gradients.

Â The most common approach is to use specifically designed recurrent

Â architectures, such as long short-term memory, or LSTM, and

Â gated recurrent unit, or GRU.

Â These architectures are very important, so we'll speak about them in a separate

Â video later this week, and now we will briefly discuss some other ideas.

Â 5:28

As you already know, the Jacobian matrix, which may cause the problem of vanishing

Â and exploding gradients, depends on the choice of the activation function and

Â the values of your recurrent weights, W.

Â 5:41

So if we want to overcome the vanishing gradient problem,

Â we should try to use the rectified linear units activation function, which is much

Â more resistant to this problem, and what about the recurrent weight matrix W?

Â From linear algebra, you may remember that an orthogonal matrix is a square matrix,

Â such as its transpose is equal to its inverse.

Â Orthogonal matrices have many interesting properties, but

Â the most important for us, is that all the agent values of an orthogonal

Â matrix have absolute value of 1.

Â This means that no matter how many times we perform repeated matrix multiplication,

Â the resulting matrix doesn't explode or vanish.

Â 6:21

Therefore, if we initialize the recurrent weight matrix W with an orthogonal matrix,

Â the second part of the Jacobian doesn't cause the vanishing gradient problem,

Â at least on the first iterations of the training.

Â 6:41

There are some approaches that utilize the properties of orthogonal matrices,

Â not just for a proper initialization, but also for

Â the parameterization of the weights for the whole training process.

Â But these methods are out of scope of this course.

Â The last idea that we will discuss in this video is the idea of using skip

Â connections.

Â In a recurrent neural network, we can't carry the contributions through

Â the gradient through a lot of timestamps, because at each step,

Â we need to multiply them by the Jacobian matrix, and as a result, they vanish.

Â 7:12

Let's add shortcuts between hidden states that are separated by more than

Â one timestamp.

Â These shortcuts are usual connections, with their own parameter matrices.

Â By using them, we create much shorter ways between faraway timestamps in the network.

Â So when we backpropagate the gradients along these short ways,

Â they vanish slower, and we can learn longer dependencies with such a network.

Â 7:49

Yeah, I spoke about the residual network,

Â which contains a date shortcut connections in each where.

Â We can also of course use the shortcuts in recurrent neural networks,

Â as well as known identity shortcuts in big form architectures for

Â the computer vision tasks.

Â 8:15

Gradient clipping is a simple method to combat the exploding gradients.

Â And truncated backpropagation through time also can help us with this,

Â in addition to the acceleration of the training.

Â To overcome the vanishing gradient problem, we can use several methods,

Â including careful choice of the activation function,

Â proper initialization of the recurrent weights, and

Â modification of the network with additional skip connections.

Â