You've learned about how RNNs work and how they

can be applied to problems like name entity recognition,

as well as to language modeling,

and you saw how backpropagation can be used to train in RNN.

It turns out that one of the problems with

a basic RNN algorithm is that it runs into vanishing gradient problems.

Let's discuss that, and then in the next few videos,

we'll talk about some solutions that will help to address this problem.

So, you've seen pictures of RNNS that look like this.

And let's take a language modeling example.

Let's say you see this sentence,

"The cat which already ate and maybe already ate a bunch of food that was delicious dot,

dot, dot, dot, was full."

And so, to be consistent,

just because cat is singular,

it should be the cat was, were then was,

"The cats which already ate a bunch of food was delicious,

and apples, and pears,

and so on, were full."

So to be consistent,

it should be cat was or cats were.

And this is one example of when language can have very long-term dependencies,

where it worked at this much earlier can

affect what needs to come much later in the sentence.

But it turns out the basics RNN we've seen so far it's not

very good at capturing very long-term dependencies.

To explain why, you might remember from

our early discussions of training very deep neural networks,

that we talked about the vanishing gradients problem.

So this is a very, very deep neural network say,

100 layers or even much deeper than you would carry out forward prop,

from left to right and then back prop.

And we said that, if this is a very deep neural network,

then the gradient from just output y,

would have a very hard time propagating back to

affect the weights of these earlier layers,

to affect the computations in the earlier layers.

And for an RNN with a similar problem,

you have forward prop came from left to right,

and then back prop,

going from right to left.

And it can be quite difficult,

because of the same vanishing gradients problem,

for the outputs of the errors associated with

the later time steps to affect the computations that are earlier.

And so in practice, what this means is,

it might be difficult to get a neural network to realize that it needs

to memorize the just see a singular noun or a plural noun,

so that later on in the sequence that can generate either was or were,

depending on whether it was singular or plural.

And notice that in English,

this stuff in the middle could be arbitrarily long, right?

So you might need to memorize the singular/plural for

a very long time before you get to use that bit of information.

So because of this problem,

the basic RNN model has many local influences,

meaning that the output y^<3> is mainly influenced by values close to y^<3>.

And a value here is mainly influenced by inputs that are somewhere close.

And it's difficult for the output here to be strongly

influenced by an input that was very early in the sequence.

And this is because whatever the output is,

whether this got it right, this got it wrong,

it's just very difficult for the area to

backpropagate all the way to the beginning of the sequence,

and therefore to modify how the neural network

is doing computations earlier in the sequence.

So this is a weakness of the basic RNN algorithm.

One, which was not addressed in the next few videos.

But if we don't address it, then RNNs

tend not to be very good at capturing long-range dependencies.

And even though this discussion has focused on vanishing gradients,

you will remember when we talked about very deep neural networks,

that we also talked about exploding gradients.

We're doing back prop,

the gradients should not just decrease exponentially,

they may also increase exponentially with the number of layers you go through.

It turns out that vanishing gradients tends to be the bigger problem with training RNNs,

although when exploding gradients happens,

it can be catastrophic because

the exponentially large gradients can cause your parameters

to become so large that your neural network parameters get really messed up.

So it turns out that exploding gradients are easier to spot because

the parameters just blow up and you might often see NaNs,

or not a numbers,

meaning results of a numerical overflow in your neural network computation.

And if you do see exploding gradients,

one solution to that is apply gradient clipping.

And what that really means,

all that means is look at your gradient vectors,

and if it is bigger than some threshold,

re-scale some of your gradient vector so that is not too big.

So there are clips according to some maximum value.

So if you see exploding gradients,

if your derivatives do explode or you see NaNs,

just apply gradient clipping,

and that's a relatively robust solution that will take care of exploding gradients.

But vanishing gradients is much harder to solve

and it will be the subject of the next few videos.

So to summarize, in an earlier course,

you saw how the training of very deep neural network,

you can run into a vanishing gradient or exploding gradient problems with the derivative,

either decreases exponentially or grows

exponentially as a function of the number of layers.

And in RNN, say in RNN processing data over a thousand times sets,

over 10,000 times sets,

that's basically a 1,000 layer or they go 10,000 layer neural network,

and so, it too runs into these types of problems.

Exploding gradients, you could sort of address by just using gradient clipping,

but vanishing gradients will take more work to address.

So what we do in the next video is talk about GRU,

the greater recurrent units,

which is a very effective solution for addressing

the vanishing gradient problem and will allow

your neural network to capture much longer range dependencies.

So, lets go on to the next video.