0:02

In this video, we will discuss

Â two main problems that arise in training Recurrent Neural Networks,

Â the problems of exploding and vanishing gradients.

Â As you already know,

Â we can use the backpropagation algorithm to train a recurrent neural network,

Â but in this case, we would backpropagate

Â gradients not only through layers but also through time.

Â As an example in the previous video,

Â we derived the expression for the gradient of the loss L with

Â respect to the current weight marked as W. And now we know,

Â that to compute the gradient of loss at time step t,

Â respect to W. We should sum up

Â the contributions from all the previous time step to these gradient.

Â Now let's look at the expression for the gradient more closely.

Â As you can see. There is a product of Jacobian matrices in each turn of the sum.

Â And if we look at the particular tool which corresponds to

Â the contributions from sum sub k to the gradient of

Â time step t. Then we see the more steps between time moments k and t,

Â the more elements are in this products.

Â So the values of Jacobian matrices have

Â a very strong influence especially on the contributions from faraway steps.

Â Let's suppose for a moment that we have only one hidden unit in our network.

Â So h is a scalar.

Â Then all the elements in the expression for the gradient are also scalars.

Â So the gradient itself is a scalar and all the Jacobian matrices are scalars and so on.

Â In this case. it is clear that if all the Jacobian matrices,

Â now Jacobian scalars are less than one in absolute value,

Â then their products goes to zero

Â exponentially faster than the number of elements in this product tends to infinity.

Â And on the contrary, if all the Jacobian scores are more than one in absolute value,

Â then the product goes to infinity exponentially fast.

Â As a result, in the first case,

Â the contributions from the following steps go to zero and

Â the gradient contained under the information about nearby steps.

Â This is difficult to know long range dependency,

Â so this is a simple recurrent neutral network.

Â This problem is usually called the vanishing gradient problem,

Â because a lot of elements in the gradient simply vanish and don't affect the training.

Â In the second case the contributions from

Â prior steps grow exponentially fast so the gradient itself grows too.

Â If an input sequence is long

Â enough the gradient may even become not a number in practice.

Â This problem is called the explosion gradient

Â problem and it makes the training very unstable.

Â There is an if that if the gradient is

Â a large number then we make

Â a long step in the direction of this gradient in the parameter space.

Â Since we optimize a very complex multi-model function and we use the drastic methods.

Â We may end up in a very poor point after such step.

Â OK. We have discussed the simplified case,

Â now let's return to the real life.

Â A recurrent neural network usually contains not just one hidden unit,

Â but the whole vector of them.

Â Consequently, the Jacobian matrices are really matrices, not scalars.

Â We can apply the same reasoning here.

Â But instead of the absolute value,

Â we need to use the spectral matrix now which is equal to

Â the largest singular value of the matrix.

Â If all the Jacobian matrices in the product have the norms which are less than one,

Â then the gradient finishes.

Â And if all the Jacobian matrices have the norms which have higher than one,

Â then the gradient explodes.

Â Now we know the values

Â of the Jacobian matrices are crucial in training a recurrent neural network.

Â So lets see what values they have in practice.

Â As you remember, the hidden units at time step t,

Â can be computed by applying some nonlinear function f to

Â a linear combination of inputs at the startup and hidden units and the previous timestep,

Â let us denote this linear combination by prioritization t.

Â To compute the Jacobian matrix of the hidden units attempts at time step T,

Â with respect to the hidden units in the previous time step,

Â we can use the chain rule.

Â So the first compute the Jacobian of H with respect to its preactivation and

Â then compute the Jacobian matrix

Â for this preactivation with respect to the previous hidden units.

Â Since F is an element twice on minority,

Â the first Jacobian here is a diagonal matrix with the derivatives of F in the diagonal.

Â And how to compute the second Jacobian? This is the question for you.

Â Since the preactivation is a linear combination of some elements,

Â the second Jacobian consists of the weights

Â of the heathen units in this linear combination.

Â So it is equal to the weight matrix W. Now lets look at

Â the best parts of the Jacobian matrix of the hidden unit H T

Â with respect to the hidden units H T minus one separately.

Â The first part depend on the type of non-linearity we use,

Â the usual choice of non-linearity for neural networks.

Â In this segment of hyperbolic tangent or rectified linear unit functions.

Â As you can see on the left part of the slide,

Â all these small linearites are very flat in the large part of the input space.

Â So the sigmoid and hyperbolic tangent are

Â almost constant for both small and large inputs,

Â and the rectified linear units is equal to zero for all the negative inputs.

Â The derivatives of this non-linearity are very

Â close to zero in the regions where where they are flipped.

Â So as you can see on the right part of this slide

Â the derivative of the sigmoid and hyperbolic tangent are less than one.

Â Almost everywhere.

Â And this may very likely cause

Â the vanishing gradient problem and the situation

Â with rectified linear unit is much better at least for positive inputs,

Â its derivative is equal to one.

Â But the gradients may seem vanished because of the zero

Â derivative in negative part of the input space.

Â Now let's look at the second part of the Jacobian of H. The value of metrics is

Â a parameter in all of the model so its norm would be either large or small.

Â The small norm could aggravate the vanishing gradient problem and

Â the large norm could cause the exploding gradient problem

Â especially in the combination with rectified linear unit non-linearity.

Â OK let's summarize what we have learned In this video.

Â Recurrent Neural Networks have sequential nature so they are very deep in time

Â before the invasion and exploiting graden problems may arise during the training of them.

Â Actually these problems are not exclusive for recurrent neural networks,

Â which also occur in deep feedforward networks.

Â Vanishing gradients make the learning of long-range dependencies very

Â difficult and exploding gradients

Â make the learning process unstable and may even crash it.

Â In the next videos, we will discuss

Â different methods that can help us to overcome these issues.

Â