0:00

In the last video, you learned about the GRU,

the gated recurrent units,

and how that can allow you to learn very long range connections in a sequence.

The other type of unit that allows you to do this very well

is the LSTM or the long short term memory units,

and this is even more powerful than the GRU. Let's take a look.

Here are the equations from the previous video for the GRU.

And for the GRU,

we had a_t equals c_t,

and two gates, the optic gate and the relevance gate, c_tilde_t,

which is a candidate for replacing the memory cell,

and then we use the update gate, gamma_u,

to decide whether or not to update c_t using c_tilde_t.

The LSTM is an even slightly more powerful and more general version of the GRU,

and is due to Sepp Hochreiter and Jurgen Schmidhuber.

And this was a really seminal paper,

a huge impact on sequence modelling.

I think this paper is one of the more difficult to read.

It goes quite along into theory of vanishing gradients.

And so, I think more people have learned about the details of LSTM through

maybe other places than from this particular paper even though I think

this paper has had a wonderful impact on the Deep Learning community.

But these are the equations that govern the LSTM.

So, the book continued to the memory cell, c,

and the candidate value for updating it, c_tilde_t,

will be this, and so on.

Notice that for the LSTM,

we will no longer have the case that a_t is equal to c_t.

So, this is what we use.

And so, this is just like the equation on the left except that with now,

more specially use a_t there or a_t minus one instead of c_t minus one.

And we're not using this gamma or this relevance gate.

Although you could have a variation of the LSTM where you put that back in,

but with the more common version of the LSTM,

doesn't bother with that.

And then we will have an update gate, same as before.

So, W updates and we're going to use a_t minus one here, x_t plus b_u.

And one new property of the LSTM is,

instead of having one update gate control,

both of these terms,

we're going to have two separate terms.

So instead of gamma_u and one minus gamma_u,

we're going have gamma_u here.

And forget gate, which we're going to call gamma_f.

So, this gate, gamma_f,

is going to be sigmoid of pretty much what you'd

expect, x_t plus b_f.

And then, we're going to have a new output gate which is sigma of W_o.

And then again, pretty much what you'd expect, plus b_o.

And then, the update value to the memory so will be c_t equals gamma u.

And this asterisk denotes element-wise multiplication.

This is a vector-vector element-wise multiplication,

plus, and instead of one minus gamma u,

we're going to have a separate forget gate, gamma_f,

times c of t minus one.

So this gives the memory cell the option of keeping

the old value c_t minus one and then just adding to it,

this new value, c tilde of t. So,

use a separate update and forget gates.

So, this stands for update, forget, and output gate.

And then finally, instead of a_t equals c_t a_t is a_t

equal to the output gate element-wise multiplied by c_t.

So, these are the equations that govern the LSTM

and you can tell it has three gates instead of two.

So, it's a bit more complicated and it places the gates into slightly different places.

So, here again are the equations governing the behavior of the LSTM.

Once again, it's traditional to explain these things using pictures.

So let me draw one here.

And if these pictures are too complicated, don't worry about it.

I personally find the equations easier to understand than the picture.

But I'll just show the picture here for the intuitions it conveys.

The bigger picture here was very much inspired by a blog post due to Chris Ola, title,

Understanding LSTM Network, and the diagram drawing

here is quite similar to one that he drew in his blog post.

But the key thing is to take away from this picture are maybe that you use

a_t minus one and x_t to compute all the gate values.

In this picture, you have a_t minus one,

x_t coming together to compute the forget gate,

to compute the update gates,

and to compute output gate.

And they also go through a tanh to compute c_tilde_of_t.

And then these values are combined in

these complicated ways with element-wise multiplies and so on,

to get c_t from the previous c_t minus one.

Now, one element of this is interesting as you have a bunch of these in parallel.

So, that's one of them and you connect them.

You then connect these temporally.

So it does the input x_1 then x_2, x_3.

So, you can take these units and just hold them up as follows,

where the output a at the previous timestep is the input a at the next timestep,

the c. I've simplified to diagrams a little bit in the bottom.

And one cool thing about this you'll notice is that

there's this line at the top that shows how,

so long as you set the forget and the update gate appropriately,

it is relatively easy for the LSTM to have

some value c_0 and have that be passed all the way to the right to have your,

maybe, c_3 equals c_0.

And this is why the LSTM,

as well as the GRU,

is very good at memorizing certain values even for a long time,

for certain real values stored in the memory cell even for many, many timesteps.

So, that's it for the LSTM.

As you can imagine,

there are also a few variations on this that people use.

Perhaps, the most common one is that instead of just having

the gate values be dependent only on a_t minus one, x_t,

sometimes, people also sneak in there the values c_t minus one as well.

This is called a peephole connection.

Not a great name maybe but you'll see, peephole connection.

What that means is that the gate values may depend not just on a_t minus one and on x_t,

but also on the previous memory cell value,

and the peephole connection can go into all three of these gates' computations.

So that's one common variation you see of LSTMs.

One technical detail is that these are, say, 100-dimensional vectors.

So if you have a 100-dimensional hidden memory cell unit, and so is this.

And the, say, fifth element

of c_t minus one affects only the fifth element of the corresponding gates,

so that relationship is one-to-one,

where not every element of

the 100-dimensional c_t minus one can affect all elements of the case.

But instead, the first element of c_t minus one affects the first element of the case,

second element affects the second element, and so on.

But if you ever read the paper and see someone talk about the peephole connection,

that's when they mean that c_t minus one is used to affect the gate value as well.

So, that's it for the LSTM.

When should you use a GRU?

And when should you use an LSTM?

There isn't widespread consensus in this.

And even though I presented GRUs first,

in the history of deep learning,

LSTMs actually came much earlier,

and then GRUs were relatively recent invention that were maybe

derived as Pavia's simplification of the more complicated LSTM model.

Researchers have tried both of these models on many different problems,

and on different problems,

different algorithms will win out.

So, there isn't a universally-superior algorithm

which is why I want to show you both of them.

But I feel like when I am using these,

the advantage of the GRU is that it's a simpler model

and so it is actually easier to build a much bigger network,

it only has two gates,

so computationally, it runs a bit faster.

So, it scales the building somewhat bigger models but the LSTM

is more powerful and more effective since it has three gates instead of two.

If you want to pick one to use,

I think LSTM has been the historically more proven choice.

So, if you had to pick one,

I think most people today will still use the LSTM as the default first thing to try.

Although, I think in the last few years,

GRUs had been gaining a lot of momentum and I feel like more and more teams

are also using GRUs because they're a bit simpler but often work just as well.

It might be easier to scale them to even bigger problems.

So, that's it for LSTMs.

Well, either GRUs or LSTMs,

you'll be able to build neural network that can capture a much longer range depends.