0:00

There are some similarities between the sequence to sequence machine translation model

Â and the language models that you have worked within the first week of this course,

Â but there are some significant differences as well.

Â Let's take a look. So, you can think of

Â machine translation as building a conditional language model.

Â Here's what I mean, in language modeling,

Â this was the network we had built in the first week.

Â And this model allows you to estimate the probability of a sentence.

Â That's what a language model does.

Â And you can also use this to generate novel sentences,

Â and sometimes when you are writing x1 and x2 here,

Â where in this example,

Â x2 would be equal to y1 or equal to y and one is just a feedback.

Â But x1, x2, and so on were not important.

Â So just to clean this up for this slide,

Â I'm going to just cross these off.

Â X1 could be the vector of all zeros and x2,

Â x3 are just the previous output you are generating.

Â So that was the language model.

Â The machine translation model looks as follows,

Â and I am going to use a couple different colors,

Â green and purple, to denote respectively

Â the coded network in green and the decoded network in purple.

Â And you notice that the decoded network looks pretty much

Â identical to the language model that we had up there.

Â So what the machine translation model is,

Â is very similar to the language model,

Â except that instead of always starting along with the vector of all zeros,

Â it instead has an encoded network

Â that figures out some representation for the input sentence,

Â and it takes that input sentence and starts off the decoded network with

Â representation of the input sentence rather than with the representation of all zeros.

Â So, that's why I call this a conditional language model,

Â and instead of modeling the probability of any sentence,

Â it is now modeling the probability of, say,

Â the output English translation,

Â conditions on some input French sentence.

Â So in other words, you're trying to estimate the probability of an English translation.

Â Like, what's the chance that the translation is "Jane is visiting Africa in September,"

Â but conditions on the input French censors like,

Â "Jane visite I'Afrique en septembre."

Â So, this is really the probability of an English sentence conditions on

Â an input French sentence which is why it is a conditional language model.

Â Now, if you want to apply this model to actually

Â translate a sentence from French into English,

Â given this input French sentence,

Â the model might tell you what is the probability

Â of difference in corresponding English translations.

Â So, x is the French sentence,

Â "Jane visite l'Afrique en septembre."

Â And, this now tells you what is the probability of

Â different English translations of that French input.

Â And, what you do not want is to sample outputs at random.

Â If you sample words from this distribution,

Â p of y given x, maybe one time you get a pretty good translation,

Â "Jane is visiting Africa in September."

Â But, maybe another time you get a different translation,

Â "Jane is going to be visiting Africa in September. "

Â Which sounds a little awkward but is not a terrible translation,

Â just not the best one.

Â And sometimes, just by chance,

Â you get, say, others: "In September,

Â Jane will visit Africa."

Â And maybe, just by chance,

Â sometimes you sample a really bad translation:

Â "Her African friend welcomed Jane in September."

Â So, when you're using this model for machine translation,

Â you're not trying to sample at random from this distribution.

Â Instead, what you would like is to find the English sentence,

Â y, that maximizes that conditional probability.

Â So in developing a machine translation system,

Â one of the things you need to do is come up with an algorithm that can actually find

Â the value of y that maximizes this term over here.

Â The most common algorithm for doing this is called beam search,

Â and it's something you'll see in the next video.

Â But, before moving on to describe beam search,

Â you might wonder, why not just use greedy search? So, what is greedy search?

Â Well, greedy search is an algorithm from computer science which says to generate

Â the first word just pick whatever is

Â the most likely first word according to your conditional language model.

Â Going to your machine translation model and then after having picked the first word,

Â you then pick whatever is the second word that seems most likely,

Â then pick the third word that seems most likely.

Â This algorithm is called greedy search.

Â And, what you would really like is to pick the entire sequence of words, y1, y2,

Â up to yTy, that's there,

Â that maximizes the joint probability of that whole thing.

Â And it turns out that the greedy approach,

Â where you just pick the best first word,

Â and then, after having picked the best first word,

Â try to pick the best second word,

Â and then, after that,

Â try to pick the best third word,

Â that approach doesn't really work.

Â To demonstrate that, let's consider the following two translations.

Â The first one is a better translation,

Â so hopefully, in our machine translation model,

Â it will say that p of y given x is higher for the first sentence.

Â It's just a better, more succinct translation of the French input.

Â The second one is not a bad translation,

Â it's just more verbose,

Â it has more unnecessary words.

Â But, if the algorithm has picked "Jane is" as the first two words,

Â because "going" is a more common English word,

Â probably the chance of "Jane is going," given the French input,

Â this might actually be higher than the chance of "Jane is

Â visiting," given the French sentence.

Â So, it's quite possible that if you just pick

Â the third word based on whatever maximizes the probability of just the first three words,

Â you end up choosing option number two.

Â But, this ultimately ends up resulting in a less optimal sentence,

Â in a less good sentence as measured by this model for p of y given

Â x. I know this was may be a slightly hand-wavey argument,

Â but, this is an example of a broader phenomenon,

Â where if you want to find the sequence of words, y1, y2,

Â all the way up to the final word that together maximize the probability,

Â it's not always optimal to just pick one word at a time.

Â And, of course, the total number of combinations of

Â words in the English sentence is exponentially larger.

Â So, if you have just 10,000 words in a dictionary and if you're

Â contemplating translations that are up to ten words long,

Â then there are 10000 to the tenth possible sentences that are ten words long.

Â Picking words from the vocabulary size,

Â the dictionary size of 10000 words.

Â So, this is just a huge space of possible sentences,

Â and it's impossible to rate them all,

Â which is why the most common thing to do is use an approximate search out of them.

Â And, what an approximate search algorithm does,

Â is it will try,

Â it won't always succeed,

Â but it will to pick the sentence, y,

Â that maximizes that conditional probability.

Â And, even though it's not guaranteed to find the value of y that maximizes this,

Â it usually does a good enough job.

Â So, to summarize, in this video,

Â you saw how machine translation can be posed as a conditional language modeling problem.

Â But one major difference between this and

Â the earlier language modeling problems is rather

Â than wanting to generate a sentence at random,

Â you may want to try to find the most likely English sentence,

Â most likely English translation.

Â But the set of all English sentences of a certain length

Â is too large to exhaustively enumerate.

Â So, we have to resort to a search algorithm.

Â So, with that, let's go onto the next video where

Â you'll learn about beam search algorithm.

Â