Word Alignments Models.

This is a super important subtask in machine translation,

because different languages have different word order,

and we need to learn that from data.

So, we need to build some word alignment models and this video is exactly about them.

Let us go a bit more formal and let us

see what are the notations of every object that we have here.

So, e is the sentence e1,

e2 and so on and f is another sentence.

So, the length of e sentence is I and the length of f sentence is J.

Now, I need to learn some alignments between them,

which I denote by a.

And importantly, you'll see that I say that e is source and f is target.

Why do I say so?

Well, usually, we talked about

machine translation system from French to English or from foreign to English.

Why do I say now that it is vice versa?

This is because we applied base rule.

So, if you remember,

we did this to have our decoupled model about language and about translation.

And now, to build the system that translates from f to e,

we need to model the probability of f given e. Now,

what about word alignments?

How do we represent them?

So, the matrix of word alignments is one nice way to do this.

You have one sentence and another sentence,

and you have zeros or ones in the matrix.

So, you'll know which words correspond to each other.

Now, how many matrices do we have here?

Well, actually, it is a huge amount of matrices.

So, imagine you have two options in every element of the matrix and then,

you have the size of the matrix which is I multiplied by J,

so the number of possible matrices would be

two to the power of the size of the matrix and that's a lot.

So, let us do some constraints,

some simplifications to deal with this.

And what we do is we say that every target word is

allowed to be aligned only to one source word, okay?

Like here. So, this is a valid example.

Now, what would be the notation here.

So, we will have a1 which will

represent the number of the source word which is aligned to the first target word.

So, this is appetite and this is the second word.

Now, what would be a2?

So, a2 will be equal to three because we have

comes matched to [inaudible] which is the third word in our source.

Now, we can proceed and do the same for a4 and five, a6.

That's it. So, this is our notation.

Please keep it in mind not to get lost.

Now, let us build the probabilistic model.

Actually, this and the next slide will be about the sketch of the whole process.

So, we are going to build the probabilistic model and figure out how we learned that.

After that, we'll go into deeper details of this probabilistic model.

So, stay with me. We have our sentences,

e and f. So,

this is our observable variables.

Now, we have also word alignments.

We do not see them, but we need to model them somehow.

So, this is hidden variables.

And we have parameters of the model and this is actually the most creative step.

So, we need somehow to decide how do we parameterized

our model to have some meaningful generative story.

And if we have too many parameters,

probably, it will be difficult to train that.

If we have too less parameters,

probably it will be not general enough to describe all the data.

So, this is the moment that we will discuss in more details later.

But for now, let's just say that we have some probabilistic model of

f and a given e and theta. What do we do next?

Well, you should know that in all these situations,

we do likelihood maximization.

So, we take our data,

we write down the probability to see our data and we try to maximize this.

Now, one complicated thing with this

is that we do not see everything that we need to model.

So, we can model the probabilities of f and a,

but we don not see a.

That's why we need to sum over all possible word alignments.

And on the left-hand side,

you have the probability of f given all the rest things,

which is called incomplete data.

Likelihood maximization for incomplete data

means that there are some hidden variables that you do not see.

And this is a very bad situation.

So, imagine you have a logarithm.

So, you take logarithm and you have logarithm of the sum.

And you don't know how to maximize these,

how to take derivatives and how to get your maximum likelihood estimations.

But actually, we have already seen this case two times in our course.

So, one was Hidden Markov Model and another was topic models.

In both those cases,

we had some hidden variables and we have these incomplete data.

And in both cases we used EM-algorithm.

So, EM-algorithm just to recap,

is an iterative process that has E-step and M-step.

The E-step is about estimates for your hidden variables.

So, the E-step will be,

what are the best alignments that we can produce right now given our perimeters?

And the M-step is vice versa.

Given our best guess for word alignments,

what would be the updates for parameters that maximize our likelihood?

This is also so interesting to go into the exact formulas of EM-algorithm.

Better, let us discuss generative model because it is really creative thing.

Well, let us start with generating the length of the sentence.

So, J would be the length of the target sentence.

Once we could generate this,

let us say that we have independent susception by the target words.

So, we have this product by J which denotes the word in our target sentence.

Every word will be not modeled yet.

So first, real model the alignment for every position.

And then, we will model the exact word given that alignment.

So, if you are scared with this formula,

you can look into just green parts.

This is the most important thing.

You model alignments and you model words given these alignments.

All the other things that you see on the right

would be just everything that we know to condition on that.

And this is too much to condition on that because we will have well too much parameters.

So, we need to do some assumptions.

So, we need to say that not all those things are important in this conditions.

The first IBM model is the first attempt to simplify this generative story.

So, what it says is,

let us forget about the priors for word alignments,

let us have just a uniform prior.

And this prior will know nothing about the positions,

but it will have just one constant to tune. So, this is awesome.

Now, the translation table will be also very simple.

So, we will say that the probability to generate the word,

given that it is aligned to some source word,

is just the translation probability of that word given the source word.

So, how does that look like?

This is the translation table.

So, once we know that the word is aligned to some source word,

we just take this probability out of it.

So, this is a very simple model,

but it has a very big drawback.

It doesn't take into account the position of

your word to produce the alignment to this word.

So, the second IBM model tries to make better and it says,

"Okay, let us take J,

the position of the target word and let us use it to produce aj.",

the alignment for this target word.

Now, how many parameters do we have to learn to build this position-based prior.

Well, actually a lot of parameters.

So, you know what, you have I multiplied by J,

which will be the size of the matrix of probabilities and it is not all.

Apart of this matrix,

you will also have different matrices for different I and J.

So, you cannot just use one and the same matrix for all kind of sentences.

You just share this matrix across all sentences with given lengths.

But for sentences with different lengths,

you have different matrix.

So, this is a lot of parameters and to try to improve on that,

we can say, "Well,

the matrix is usually very close to diagonal.

What if we model it as a diagonal matrix?"

This is what Chris Dyer said in 2013.

So, this model has only one perimeter that says,

how is the probability mass spread around the diagonal?

And this is nice because it is still has some priors about positions,

but it has not too many parameters.

Now, I have the last example for you for alignments.

So, actually, you already know how to build this,

you just don't remember that.

We had Hidden Markov Models in our course and Hidden Markov Models

can help to build some transition probabilities. Why do we need it here?

So, imagine these couple of sentences and the phrase in

the beginning of the sentence can be aligned to the phrase in the end of the sentence.

But sometimes, inside this phrase,

you just go word-by-word so you do not have any gaps. And this is nice.

It means that you need to learn these and to

use this information that the previous word was

aligned to position five and maybe that means

that the next word should be aligned to position six.

So, this is what Hidden Markov Model can make for you.

So, you model the probability of the next alignment given the previous alignment.

So now, let us sum up what we have in this video.

So, we have covered IBM models,

which is a nice word-based technique to build machine translation systems.

And actually, there are lots of problems with them that we did not cover.

And there are IBM Model Number three and four and five that

can try to cope with the problem of fertility, for example,

saying that we need to explicitly model

how many output words are produced by each source word,

or that we need to explicitly deal with spurious word.

This are the words that just appear from nowhere in the target language.