0:00

In the last video, you saw how the attention model allows

Â a neural network to pay attention to

Â only part of an input sentence while it's generating a translation,

Â much like a human translator might.

Â Let's now formalize that intuition into

Â the exact details of how you would implement an attention model.

Â So same as in the previous video,

Â let's assume you have an input sentence and you use a bidirectional RNN,

Â or bidirectional GRU, or bidirectional LSTM to compute features on every word.

Â In practice, GRUs and LSTMs are often used for this,

Â with maybe LSTMs be more common.

Â And so for the forward occurrence,

Â you have a forward occurrence first time step.

Â Activation backward occurrence, first time step.

Â Activation forward occurrence, second time step.

Â Activation backward and so on.

Â For all of them in just a forward fifth time step a backwards fifth time step.

Â We had a zero here technically we can also

Â have I guess a backwards sixth as a factor of all zero,

Â actually that's a factor of all zeroes.

Â And then to simplify the notation going forwards at every time step,

Â even though you have the features computed from

Â the forward occurrence and from the backward occurrence in the bidirectional RNN.

Â I'm just going to use a of t to represent both of these concatenated together.

Â So a of t is going to be a feature vector for

Â time step t. Although to be consistent with notation,

Â we're using second, I'm going to call this t_prime.

Â Actually, I'm going to use t_prime to index into the words in the French sentence.

Â Next, we have our forward only,

Â so it's a single direction RNN with state s to generate the translation.

Â And so the first time step,

Â it should generate y1 and just will have as input some context

Â C. And if you want to index it with time I guess you

Â could write a C1 but sometimes I just right C without the superscript one.

Â And this will depend on the attention parameters so alpha_11,

Â alpha_12 and so on tells us how much attention.

Â And so these alpha parameters tells us how much the context would depend

Â on the features we're getting or the activations we're

Â getting from the different time steps.

Â And so the way we define the context is actually be a way to some of

Â the features from the different time steps waited by these attention waits.

Â So more formally the attention waits will satisfy this that they are all be non-negative,

Â so it will be a zero positive and they'll sum to one.

Â We'll see later how to make sure this is true.

Â And we will have the context or the context at time one

Â often drop that superscript that's going to be sum over t_prime,

Â all the values of t_prime of this waited

Â sum of these activations.

Â So this term here are the attention waits and this term here comes from here.

Â So alpha(t_prime) is the amount of attention that's

Â yt should pay to a of t_prime.

Â So in other words,

Â when you're generating the t of the output words,

Â how much you should be paying attention to the t_primeth input to word.

Â So that's one step of generating the output and then at the next time step,

Â you generate the second output and is again done some of

Â where now you have a new set of attention waits on they to find a new way to sum.

Â That generates a new context.

Â This is also input and that allows you to generate the second word.

Â Only now just this way to sum becomes the context of

Â the second time step is sum over t_prime alpha(2, t_prime).

Â So using these context vectors.

Â C1 right there back,

Â C2, and so on.

Â This network up here looks like a pretty standard RNN sequence

Â with the context vectors as output and we

Â can just generate the translation one word at a time.

Â We have also define how to compute the context vectors in terms of

Â these attention ways and those features of the input sentence.

Â So the only remaining thing to do is to

Â define how to actually compute these attention waits.

Â Let's do that on the next slide.

Â So just to recap, alpha(t,

Â t_prime) is the amount of attention you should paid to

Â a(t_prime ) when you're trying to generate the t th words in the output translation.

Â So let me just write down the formula and we talk of how this works.

Â This is formula you could use the compute alpha(t,

Â t_prime) which is going to compute these terms e(t,

Â t_prime) and then use essentially a soft pass to make sure that

Â these waits sum to one if you sum over t_prime.

Â So for every fix value of t,

Â these things sum to one if you're summing over t_prime.

Â And using this soft max prioritization,

Â just ensures this properly sums to one.

Â Now how do we compute these factors e. Well,

Â one way to do so is to use a small neural network as follows.

Â So s t minus one was the neural network state from the previous time step.

Â So here is the network we have.

Â If you're trying to generate yt then st minus one was the hidden state from

Â the previous step that just fell into st

Â and that's one input to very small neural network.

Â Usually, one hidden layer in neural network because you need to compute these a lot.

Â And then a(t_prime) the features from time step t_prime is the other inputs.

Â And the intuition is,

Â if you want to decide how much attention to pay to the activation of t_prime.

Â Well, the things that seems like it should depend the most on

Â is what is your own hidden state activation from the previous time step.

Â You don't have the current state activation yet

Â because of context feeds into this so you haven't computed that.

Â But look at whatever you're hidden stages of this RNN generating

Â the upper translation and then for each of the positions,

Â each of the words look at their features.

Â So it seems pretty natural that alpha(t,

Â t_prime) and e(t, t_prime) should depend on these two quantities.

Â But we don't know what the function is.

Â So one thing you could do is just train a very small neural network

Â to learn whatever this function should be.

Â And trust that obligation trust wait and descent to learn the right function.

Â And it turns out that if you implemented

Â this whole model and train it with gradient descent,

Â the whole thing actually works.

Â This little neural network does a pretty decent job telling

Â you how much attention yt should pay to

Â a(t_prime) and this formula makes sure that

Â the attention waits sum to one and then as you chug along generating one word at a time,

Â this neural network actually pays attention to the right parts of

Â the input sentence that learns all this automatically using gradient descent.

Â Now, one downside to this algorithm is that

Â it does take quadratic time or quadratic cost to run this algorithm.

Â If you have tx words in the input and ty words in

Â the output then the total number of

Â these attention parameters are going to be tx times ty.

Â And so this algorithm runs in quadratic cost.

Â Although in machine translation applications where

Â neither input nor output sentences is

Â usually that long maybe quadratic cost is actually acceptable.

Â Although, there is some research work on trying to reduce costs as well.

Â Now, so far up in describing the attention idea in the context of machine translation.

Â Without going too much into detail this idea has been applied to other problems as well.

Â So just image captioning.

Â So in the image capturing problem the task is to

Â look at the picture and write a caption for that picture.

Â So in this paper set to the bottom by Kevin Chu,

Â Jimmy Barr, Ryan Kiros, Kelvin Shaw, Aaron Korver,

Â Russell Zarkutnov, Virta Zemo,

Â and Andrew Benjo they also showed that you could have a very similar architecture.

Â Look at the picture and pay attention only to parts

Â of the picture at a time while you're writing a caption for a picture.

Â So if you're interested, then I encourage you to take a look at that paper as well.

Â And you get to play with all this and more in the programming exercise.

Â Whereas machine translation is a very complicated problem in the prior exercise you

Â get to implement and play of the attention while you

Â yourself for the date normalization problem.

Â So the problem inputting a date like this.

Â This actually has a date of the Apollo Moon landing and normalizing it into

Â standard formats or a date like this and having a neural network a sequence,

Â sequence model normalize it to this format.

Â This by the way is the birthday of William Shakespeare.

Â Also it's believed to be.

Â And what you see in prior exercises as you can train

Â a neural network to input dates in any of

Â these formats and have it use an attention model

Â to generate a normalized format for these dates.

Â One other thing that sometimes fun to do is

Â to look at the visualizations of the attention waits.

Â So here's a machine translation example and here were plotted in different colors.

Â the magnitude of the different attention waits.

Â I don't want to spend too much time on this but you find that

Â the corresponding input and output words

Â you find that the attention waits will tend to be high.

Â Thus, suggesting that when it's generating a specific word in output is,

Â usually paying attention to the correct words in the input and all this including

Â learning where to pay attention when was all

Â learned using propagation with an attention model.

Â So that's it for the attention model

Â really one of the most powerful ideas in deep learning.

Â I hope you enjoy implementing and playing with

Â these ideas yourself later in this week's programming exercises.

Â