Hey, and welcome back.

This is what you have already seen in the end of our previous video.

So just to remind,

we have some sequences and we are going to predict the probabilities of these sequences.

So we learnt that with bigger language model,

you can factorize your probability into some terms.

So these are the probabilities of the next word,

given the previous words.

Now, take a moment to see whether everything is okay with the indices on this slide.

Well, you can notice that i can be equal to 0 or to k plus 1,

and it goes out of range of our sequence.

But that's okay because if you remember our previous video,

we discussed that we should have some fake tokens

in the beginning of the sequence and in the end of the sequence.

So this iequal to 0 and to k plus 1 will be exactly these fake tokens.

So everything good here.

Let us move forward.

This is just a generalization.

This is n-gram language model.

So the only difference here is that the history gets longer.

So we condition not only on the previous words but on

the whole sequence of n minus 1 previous words.

So just take a note to these denotions here.

This is just a brief way to show that we have a sequence of n minus one words.

Great. We have some intuition how to estimate these probabilities.

So you remember that we can just count some n-grams and normalize these counts.

But, now, I want to give you some intuition,

not only just intuition but mathematical justification.

Well, we have some probabilistic model,

and we have some data.

And we want to learn the parameters of this model.

What do we do in this case?

So what you do is likelihood maximization by W train,

and you note here my train data.

So this is just a concatenation off all the training sequences that I have,

giving a total of big M tokens.

Now, I take the logarithm of

this probability because it is easier to optimize the sum of logarithms,

rather than the product of probabilities.

And I just write down the probability of my data according to my model.

Okay? So if I'm not too lazy,

I would take the derivatives of this likelihood,

and I would also think about constraints,

such as normalization and non-negativity of my parameters.

And I will derive to exactly

these formulas that you will see in the bottom of this slide.

So these counts and normalization of these counts have mathematical justification,

which is likelihood maximization.

So this is just the likelihood maximization estimates.

Awesome. We now can train our language model.

Now, can we show some example how it works?

This is a model trained on Shakespeare corpus.

So you can see that unigram model and bigram model give something meaningful,

and 3-gram model and 4-gram model are probably even better.

So you can see that the model actually generates some text, which resembles Shakespeare.

Now, I have a question for you.

How would you choose the best n here?

Do you have any intuition or maybe the procedure to find the best n for your model?

Well, for this case,

I would say that 5-gram models usually are the best for language modeling,

but it is really, really dependent on your data and on your certain task.

So the general question is how do we decide which model is better?

How do we evaluate and compare our models?

So one way to go is to do extrinsic evaluation.

So, for example, we can have

some machine translation system or speech recognition system,

any final application, and we can measure the quality of this application.

This is a good way, but sometimes we do not have

time or resources to build the whole application.

Okay? So we want also to have some intrinsic evaluation,

which means just to evaluate the language model itself.

And one way that people use all the time is called perplexity.

It is called holdout perplexity.

Why? Because we have some data, and usually,

we held out some data to compute perplexity later.

So this is holdout data.

This is just other words to say that we need some transplit and test split.

So what is perplexity?

Well, you know what is the likelihood.

So, here, I just write down the likelihood for my test data,

and perplexity is super similar.

So perplexity has just likelihood in the denominator.

You can be curious why exactly this formula.

Well it is really related to entropy,

but we are not going into details right now.

So the thing that we need to know is that the lower perplexity is, the better.

Why? Because the greater likelihood is, the better.

So the likelihood shows whether our model is surprised with our text or not,

whether our model predicts exactly the same test data that we have in real life.

So perplexity has also this intuition.

And, remember, the lower perplexity, the better.

Let us try to compute perplexity for some small toy data.

So this is some toy train corpus and toy test corpus.

What is the perplexity here?

Well, we shall start with computing probabilities of our model.

So I compute some probability,

and I get zero.

It means that the probability of the whole test data is also zero,

and the perplexity is infinite.

And that's definitely not what we like.

How can I fix that? What can we do with it?

Well, there is actually a very simple way to fix that.

So let us say that we have some vocabulary.

Actually, that we build some vocabulary in beforehand,

just by some frequencies,

or we just take it from somewhere.

And after that, we substitute all out of

vocabulary tokens for train and for test sets for a special <UNK> token.

Okay. So then we compute our probabilities as usual for

all vocabulary tokens and for the <UNK> token because

we also see this <UNK> token in the training data.

Right? And this is what we can use because now,

when we see our test data,

we see they're only vocabulary tokens and <UNK> token,

and we compute probabilities for all of them, and that's okay.

Now, imagine we have no out of vocabulary words.

We could fix that.

Let's try to compute perplexity again.

So this is the toy data.

What is the perplexity?

The probability of some tokens is still zero because

we do not see this bigram in our train data,

which means the whole probability is zero.

The perplexity is infinite,

and this is again not what we like.

So for this case,

we need to use some smoothing techniques.

And this is exactly what our next video is about.