0:00

You've seen how a basic RNN works.

Â In this video, you learn about the Gated Recurrent Unit which is

Â a modification to the RNN hidden layer that makes it much better capturing

Â long range connections and helps a lot with the vanishing gradient problems.

Â Let's take a look.

Â You've already seen the formula for computing the activations at time t of RNN.

Â It's the activation function applied to

Â the parameter Wa times the activations in the previous time set,

Â the current input and then plus ba.

Â So I'm going to draw this as a picture.

Â So the RNN unit, I'm going to draw as a picture,

Â drawn as a box which inputs a of t-1, the activation for the last time-step.

Â And also inputs x<t> and these two go together.

Â And after some weights and after this type of linear calculation,

Â if g is a tanh activation function,

Â then after the tanh,

Â it computes the output activation a.

Â And the output activation a(t) might also be passed to say

Â a softener unit or something that could then be used to output y<t>.

Â So this is maybe a visualization of

Â the RNN unit of the hidden layer of the RNN in terms of a picture.

Â And I want to show you this picture because we're going to use

Â a similar picture to explain the GRU or the Gated Recurrent Unit.

Â Lots of the idea of GRU were due to these two papers respectively by Yu Young Chang,

Â Kagawa, Gaza Hera, Chang Hung Chu and Jose Banjo.

Â And I'm sometimes going to refer to this sentence

Â which we'd seen in the last video to motivate that.

Â Given a sentence like this,

Â you might need to remember the cat was singular,

Â to make sure you understand why that was rather than were.

Â So the cat was for or the cats were for.

Â So as we read in this sentence from left to right,

Â the GRU unit is going to have a new variable called c,

Â which stands for cell, for memory cell.

Â And what the memory cell do is it will provide a bit of memory to remember, for example,

Â whether cat was singular or plural,

Â so that when it gets much further into the sentence it can

Â still work under consideration

Â whether the subject of the sentence was singular or plural.

Â And so at time t the memory cell will have some value c of

Â t. And what we see is that the GRU unit will actually

Â output an activation value a of t that's equal to c of t. And for now I wanted to

Â use different symbol c and a to denote

Â the memory cell value and the output activation value,

Â even though they are the same.

Â I'm using this notation because when we talk about LSTMs,

Â a little bit later,

Â these will be two different values.

Â But for now, for the GRU,

Â c of t is equal to the output activation a of t.

Â So these are the equations that govern the computations of a GRU unit.

Â And every time-step, we're going to consider overwriting

Â the memory cell with a value c tilde of t. So this is going to be

Â a candidate for replacing c of t. And we're going to compute this

Â using an activation function tanh of Wc.

Â And so that's the parameter to make sure it's Wc and we'll plus this parameter matrix,

Â the previous value of the memory cell,

Â the activation value as well as the current input value x<t>,

Â and then plus the bias.

Â So c tilde of t is going to be a candidate for replacing c<t>.

Â And then the key,

Â really the important idea of the GRU it will be that we have a gate.

Â So the gate, I'm going to call gamma u.

Â This is the capital Greek alphabet gamma subscript u,

Â and u stands for update gate,

Â and this will be a value between zero and one.

Â And to develop your intuition about how GRUs work,

Â think of gamma u,

Â this gate value, as being always zero or one.

Â Although in practice, your compute it with a sigmoid function applied to this.

Â So remember that the sigmoid function looks like this.

Â And so it's value is always between zero and one.

Â And for most of the possible ranges of the input,

Â the sigmoid function is either very,

Â very close to zero or very, very close to one.

Â So for intuition, think of gamma as being either zero or one most of the time.

Â And this alphabet u stands for- I chose

Â the alphabet gamma for this because if you look at a gate fence,

Â looks a bit like this I guess,

Â then there are a lot of gammas in this fence.

Â So that's why gamma u,

Â we're going to use to denote the gate.

Â Also Greek alphabet G, right.

Â G for gate. So G for gamma and G for gate.

Â And then next, the key part of the GRU is this equation which is

Â that we have come up with a candidate where we're thinking of updating c using c tilde,

Â and then the gate will decide whether or not we actually update it.

Â And so the way to think about it is maybe this memory cell c

Â is going to be set to either

Â zero or one depending on whether the word you are considering,

Â really the subject of the sentence is singular or plural.

Â So because it's singular,

Â let's say that we set this to one.

Â And if it was plural, maybe we would set this to zero,

Â and then the GRU unit would memorize the value of the c<t> all the way until here,

Â where this is still equal to one and so that tells it,

Â oh, it's singular so use the choice was.

Â And the job of the gate, of gamma u,

Â is to decide when do you update these values.

Â In particular, when you see the phrase, the cat,

Â you know they you're talking about

Â a new concept the especially subject of the sentence cat.

Â So that would be a good time to update this bit and then maybe when you're done using it,

Â the cat blah blah blah was full, then you know, okay,

Â I don't need to memorize anymore,

Â I can just forget that.

Â So the specific equation we'll use for the GRU is the following.

Â Which is that the actual value of c<t> will be equal to this gate times

Â the candidate value plus one minus

Â the gate times the old value, c<t> minus one.

Â So you notice that if the gate,

Â if this update value,

Â this equal to one, then it's saying

Â set the new value of c<t> equal to this candidate value.

Â So that's like over here,

Â set gate equal to one so go ahead and update that bit.

Â And then for all of these values in the middle,

Â you should have the gate equals zero.

Â So this is saying don't update it,

Â don't update it, don't update it,

Â just hang onto the old value.

Â Because if gamma u is equal to zero,

Â then this would be zero,

Â and this would be one.

Â And so it's just setting c<t> equal to the old value,

Â even as you scan the sentence from left to right.

Â So when the gate is equal to zero,

Â we're saying don't update it,

Â don't update it, just hang on to the value and don't forget what this value was.

Â And so that way even when you get all the way down here,

Â hopefully you've just been setting c<t> equals c<t> minus one all along.

Â And it still memorizes,

Â the cat was singular.

Â So let me also draw a picture to denote the GRU unit.

Â And by the way, when you look in

Â online blog posts and textbooks and tutorials these types

Â of pictures are quite popular for explaining GRUs as well as we'll see later, LSTM units.

Â I personally find the equations easier to understand in a pictures.

Â So if the picture doesn't make sense.

Â Don't worry about it, but I'll just draw in case helps some of you.

Â So a GRU unit inputs c<t> minus one,

Â for the previous time-step and just happens to be equal to 80 minus one.

Â So take that as input and then it also takes as input x<t>,

Â then these two things get combined together.

Â And with some appropriate weighting and some tanh,

Â this gives you c tilde t which is a candidate for placing c<t>,

Â and then with a different set of parameters and through a sigmoid activation function,

Â this gives you gamma u,

Â which is the update gate.

Â And then finally, all of these things combine together through another operation.

Â And I won't write out the formula,

Â but this box here which wish I shaded in purple

Â represents this equation which we had down there.

Â So that's what this purple operation represents.

Â And it takes as input the gate value, the candidate new value,

Â or there is this gate value again and the old value for c<t>, right.

Â So it takes as input this, this and this and

Â together they generate the new value for the memory cell.

Â And so that's c<t> equals a.

Â And if you wish you could also use this process to soft max

Â or something to make some prediction for y<t>.

Â So that is the GRU unit or at least a slightly simplified version of it.

Â And what is remarkably good at is through the gates

Â deciding that when you're scanning the sentence from left to right say,

Â that's a good time to update one particular memory cell and then to not change,

Â not change it until you get to the point where you really need it to

Â use this memory cell that is set even earlier in the sentence.

Â And because the sigmoid value,

Â now, because the gate is quite easy to set to zero right.

Â So long as this quantity is a large negative value,

Â then up to numerical around off the uptake gate will be essentially zero.

Â Very, very, very close to zero.

Â So when that's the case,

Â then this updated equation and subsetting c<t> equals c<t> minus one.

Â And so this is very good at maintaining the value for the cell.

Â And because gamma can be so close to zero,

Â can be 0.000001 or even smaller than that,

Â it doesn't suffer from much of a vanishing gradient problem.

Â Because when you say gamma so close to zero

Â this becomes essentially c<t> equals c<t> minus

Â one and the value of c<t> is maintained pretty much

Â exactly even across many many many many time-steps.

Â So this can help significantly with the vanishing gradient problem

Â and therefore allow a neural network to go on even very long range dependencies,

Â such as a cat and was related even if they're separated by a lot of words in the middle.

Â Now I just want to talk over some more details of how you implement this.

Â In the equations I've written,

Â c<t> can be a vector.

Â So if you have

Â 100 dimensional or hidden activation value then c<t> can be a 100 dimensional say.

Â And so c tilde t would also be the same dimension,

Â and gamma would also be the same dimension as the other things on drawing boxes.

Â And in that case, these asterisks are actually element wise multiplication.

Â So here if gamma u,

Â if the gate is 100 dimensional vector,

Â what it is really a 100 dimensional vector of bits,

Â the value is mostly zero and one.

Â That tells you of this 100 dimensional memory cell which are the bits you want to update.

Â And, of course, in practice gamma won't be exactly zero or one.

Â Sometimes it takes values in the middle as well but it is convenient for intuition

Â to think of it as mostly taking on values that are exactly zero,

Â pretty much exactly zero or pretty much exactly one.

Â And what these element wise multiplications do is it

Â just element wise tells the GRU unit which

Â other bits in your- It just tells your GRU which

Â are the dimensions of your memory cell vector to update at every time-step.

Â So you can choose to keep some bits constant while updating other bits.

Â So, for example, maybe you use one bit to remember

Â the singular or plural cat and maybe use

Â some other bits to realize that you're talking about food.

Â And so because you're talk about eating and talk about food,

Â then you'd expect to talk about whether the cat is four letter, right.

Â You can use different bits and change only a subset of the bits every point in time.

Â You now understand the most important ideas of the GRU.

Â What I'm presenting in this slide is actually a slightly simplified GRU unit.

Â Let me describe the full GRU unit.

Â So to do that, let me copy the three main equations.

Â This one, this one and this one to the next slide.

Â So here they are.

Â And for the full GRU unit,

Â I'm sure to make one change to this which is,

Â for the first equation which was calculating the candidate new value for the memory cell,

Â I'm going just to add one term.

Â Let me pushed that a little bit to the right,

Â and I'm going to add one more gate.

Â So this is another gate gamma r. You can think of r as standing for relevance.

Â So this gate gamma r tells you how relevant is c<t>

Â minus one to computing the next candidate for c<t>.

Â And this gate gamma r is

Â computed pretty much as you'd expect with a new parameter matrix Wr,

Â and then the same things as input x<t> plus br.

Â So as you can imagine there are multiple ways to design these types of neural networks.

Â And why do we have gamma r?

Â Why not use a simpler version from the previous slides?

Â So it turns out that over many years researchers have experimented with many,

Â many different possible versions of how to design these units,

Â to try to have longer range connections,

Â to try to have

Â more the longer range effects and also address vanishing gradient problems.

Â And the GRU is one of the most commonly used versions that

Â researchers have converged to and found as robust and useful for many different problems.

Â If you wish you could try to invent new versions of these units if you want,

Â but the GRU is a standard one,

Â that's just common used.

Â Although you can imagine that researchers have tried other versions

Â that are similar but not exactly the same as what I'm writing down here as well.

Â And the other common version is called an LSTM

Â which stands for Long Short Term Memory which we'll talk about in the next video.

Â But GRUs and LSTMs are two specific instantiations

Â of this set of ideas that are most commonly used.

Â Just one note on notation.

Â I tried to define a consistent notation to make these ideas easier to understand.

Â If you look at the academic literature,

Â you sometimes see people- If you look at

Â the academic literature sometimes you see people

Â using alternative notation to be x tilde,

Â u, r and h to refer to these quantities as well.

Â But I try to use a more consistent notation between GRUs and LSTMs as

Â well as using a more consistent notation gamma to refer to the gates,

Â so hopefully make these ideas easier to understand.

Â So that's it for the GRU, for the Gate Recurrent Unit.

Â This is one of the ideas in RNN that has enabled them to become

Â much better at capturing very long range dependencies has made RNN much more effective.

Â Next, as I briefly mentioned,

Â the other most commonly used variation of

Â this class of idea is something called the LSTM unit,

Â Long Short Term Memory unit.

Â Let's take a look at that in the next video.

Â