0:00

In this video, I'm going to talk about the exploding and vanishing gradients problem,

which is what makes it difficult to train recurrent neural networks.

For many years, researchers in URL networks thought they would never be able

to train these networks to model dependencies over long time periods.

But at the end of this video, I can describe four different ways in which that

can now be done. To understand why it's so difficult to

train recurrent neural networks, we have to understand a very important difference

between the forward and backward passes in a recurrent neural net.

0:42

In the forward pass, we used squashing functions, like the logistic, to prevent

the activity vectors from exploding. So, if you look at the picture on the

right, each neuron is using a logistic unit shown by that blue curve and it can't

output any value greater than one or less than zero,

So that stops explosions. The backward pass, however, is completely

linear. Most people find this very surprising.

If you double the error derivatives is it the final layers of this net, all the

error derivatives will double when you back propagate.

So, if you look at the red dots that I put on the blue curves,

We'll suppose those are the activity levels of the neurons on the forward pass.

And so, when you back propagate, you're using the gradients of the blue curves at

those red dots. So the red lines are meant to throw the

tangents to the blue curves at the red dots.

And, once you finish the forward pass, the slope of that tangent is fixed.

You now back propagate and the back propagation is like going forwards though

a linear system in which the slope of the non-linearity has been fixed.

Of course, each time you back propagate, the slopes will be different because they

were determined by the forward pass. But during the back propagation, it's a

linear system and so it suffers from a problem of linear systems, which is when

you iterate, they tend to either explode or die.

So when we backpropagate through many layers if the weights are small the

gradients will shrink and become exponentially small. And that means that

when you backpropagate through time gradients that are many steps earlier than

the targets arrive will be tiny. Similarly, if the weights are big, the

gradients will explode. And that means when you back propagate

through time, the gradients will get huge and wipe out all your knowledge.

3:00

But as soon as we have a recurrent neural network trained on a long sequence, for

example 100 time steps, then if the gradients are growing as we back

propagate, we'll get whatever that growth rate is to the power of 100 and if they're

dying, we'll get whatever that decay is to the power of 100 and, so, they'll either

explode or vanish. We might be able to initialize the weights

very carefully to avoid this and more recent work, shows that indeed careful

initialization of the weights does make things work much better.

3:56

Here's an example of exploding and dying gradients for a system that's trying to

learn attractors. So suppose we try and train a recurrent

neural network, So that whatever state we started in, it

ends up in one of these two attractor states.

So we're going a learn blue basin of attraction and a pink basin of attraction.

And if we start anywhere within the blue basin of attraction, we will end up at the

same point. What that means is that, small differences

in our initial state make no difference to where we end up.

So the derivative of the final state with respect to changes in the initial state,

is zero. That's vanishing gradients.

When we back propagate through the dynamics of this system we will discover

there's no gradience from where you start, and the same with the pink basin of

attraction. If however, we start very close to the

boundary between the two attractors. Then, a tiny difference in where we start,

that's the other side of the watershed, makes the huge difference to where we end

up, that's the explosion gradient problem. And so whenever your trying to use a

recurrent neural network to learn attractors like this, you're bound to get

vanishing or exploding gradients. It turns out, there's at least four

effective ways to learn a recurrent neural network.

5:45

The second method is to use a much better optimizer that can deal with very small

gradients. I'll talk about that in the next lecture.

The real problem in optimization is to detect small gradients that have an even

smaller curvature. Heissan-free Optimization, tailored to

your own apps is good at doing that. The third method really kind of evades the

problem. What we do is we carefully initialize the

input to hidden weights and we very carefully initialize the hidden to hidden

weights, and also feedback weights from the outputs to the hidden units.

And the idea of this careful initialization is to make sure that the

hidden state has a huge reservoir of weakly coupled oscillators.

So if you hit it with an input sequence, it will reverberate for a long time and

those reverberations are remembering what happened in the input sequence You then

try and couple those reverberations to the output you want and so the only thing that

learns in an Echo State Network is the connections between the hidden units and

the outputs. And if the output units are linear, that's

very easy to train. So this hasn't really learned the

recurrent. It's used a fixed random recurrent bit, but a carefully chosen one

and then just learned the hidden tripod connections.

7:16

And the final method is to use momentum, but to use momentum with the kind of

initialization that was being used for Echo State Networks and that makes them

work even better. So it was very clever to find out how to

initialize these recurrent networks so they'll have interesting dynamics, but

they work even better if you now modify that dynamic slightly in that direction

that will help with the task at hand.