0:02

Hey. Attention mechanism is a super powerful technique in neural networks.

So let us cover it first with some pictures and then with some formulas.

Just to recap, we have some encoder that has h states and decoder that has some s states.

Now, let us imagine that we want to produce the next decoder state.

So we want to compute sj. How can we do this?

In the previous video,

we just used the v vector,

which was the information about the whole encoded input sentence.

And instead of that,

we could do something better.

We can look into all states of the encoder with some weights.

So this alphas denote some weights

that will say us whether it is important to look there or here.

How can we compute this alphas?

Well, we want them to be probabilities, and also,

we want them to capture some similarity between

our current moment in the decoder and different moments in the encoder.

This way, we'll look into more similar places,

and they will give us the most important information to go next with our decoding.

If we speak about the same thing with the formulas,

we will say that, now,

instead of just one v vector that we had before,

we will have vj, which is different for different positions of the decoder.

And this vj vector will be computed as an average of encoder states.

And the weights will be computed as soft marks because they need to be probabilities.

And this soft marks will be applied to similarities of encoder and decoder states.

Now, do you have any ideas how to compute those similarities?

I have a few.

So papers actually have tried lots and lots of different options,

and there are just three options for you to try to memorize.

Maybe the easiest option is in the bottom.

Let us just do dot product of encoder and decoder states.

It will give us some understanding of their similarity.

Another way is to say,

maybe we need some weights there,

some metrics that we need to learn,

and it can help us to capture the similarity better.

This thing is called multiplicative attention.

And maybe we just do not want to care at all with our mind how to compute it.

We just want to say, "Well,

neural network is something intelligent.

Please do it for us."

And then we just take one layer over

neural network and say that it needs to predict these similarities.

So you see there that you have h and s multiplied by some matrices and summed.

That's why it is called additive attention.

And then you have some non-linearity applied to this.

These are three options,

and you can have also many more options.

Now, let us put all the things together,

just again to understand how does attention works.

You have your conditional language modeling task.

You'll try to predict Y sequence given s sequence.

And now, you encode your x sequence to some vj vector,

which is different for every position.

This vj vector is used in the decoder.

It is concatenated with the current input of the decoder.

And this way, the decoder is aware of

all the information that it needs, the previous state,

the current input, and now,

this specific context vector,

computed especially for this current state.

Now, let us see where the attention works.

So neural machine translation had lots of problems with long sentences.

You can see that blue score for long sentences is really lower,

though it is really okay for short ones.

Neural machine translation with attention can solve this problem,

and it performs really nice for even long sentences.

Well, this is really intuitive because attention helps to

focus on different parts of the sentence when you do your predictions.

And for long sentences,

it is really important because, otherwise,

you have to encode the whole sentence into just one vector,

and this is obviously not enough.

Now, to better understand those alpha IJ ways that we have learned with the attention,

let us try to visualize them.

This weights can be visualized with I by J matrices.

Let's say, what is the best promising place in

the encoder for every place in the decoder?

So with the light dot here,

you can see those words that are aligned.

So you see this is a very close analogy to word alignments that we have covered before.

We just learn that these words are somehow similar,

relevant, and we should look into this once to translate them to another language.

And this is also a good place to note

that we can use some techniques from traditional methods,

from words alignments and incorporate them to neural machine translation.

For example, priors for words alignments can

really help here for neural machine translation.

Now, do you think that this attention technique is really

similar to how humans translate real sentences?

I mean, humans also look into some places and then translate this places.

They have some attention.

Do you see any differences?

Well, actually there is one important difference here.

So humans save time with attention because

they look only to those places that are relevant.

On the contrary, here,

we waste time because to guess what is the most relevant place,

we first need to check out all the places and compute

similarities for the whole encoder states.

And then just say, "Okay,

this piece of the encoder is the most meaningful."

Now, the last story for this video is how to

make this attention save time, not waste time.

It is called local attention,

and the idea is rather simple.

We say, let us first time try to predict what is the best place to look at.

And then after that,

we will look only into some window around this place.

And we will not compute similarities for the whole sequence.

Now, first, how you can predict the best place.

One easy way would be to say, "You know what?

Those matrices should be strictly diagonal,

and the place for position J should be J."

Well, for some languages,

it might be really bad if you have

some different orders and then you can try to predict it.

How do you do this?

You have this sigmoid for something complicated.

This sigmoid gives you probability between zero to one.

And then you scale this by the length of the input sentence I.

So you see that this will be indeed something in between zero and I,

which means that you will get some position in the input sentence.

Now, what is inside those sigmoid?

Well, you see a current decoder state sj,

and you just apply some transformations as usual in neural networks.

Anyway, so when you have this aj position,

you can just see that you need to look only into this window and

compute similarities for attention alphas as usual,

or you can also try to use some Gaussian to say that

actually those words that are in the middle of the window are even more important.

So you can just multiply some Gaussian priors

by those alpha weights that we were computing before.

Now, I want to show you the comparison of different methods.

You can see here that we have global attention and local attention.

And for local attention, we have monotonic predictions and predictive approach.

And the last one performs the best.

Do you remember what is inside the brackets here?

These are different ways to compute similarities for attention weights.

So you remember dot product and multiplicative attention?

And, also, you could have location-based attention,

which is even more simple.

It says that we should just take sj and use it to compute those weights.

This is all for that presentation,

and I am looking forward to see you in the next one.