Hey, do you remember the motivation from our previous video?

So the language is really variative.

It means that if we train a model on a train data,

it is very likely that whether we apply this model to the test data,

we will get some zeros.

For example, some bigrams will not occur in the test data.

So, what can we do about those zeros?

Could we probably just substitute them by ones?

Well, actually, we cannot do this.

And the reason is that in this way,

you will not get a correct probability distribution,

so it will not be normalized into one.

Instead, we can do another simple thing.

We can add one to all the accounts,

even those that are not zeros.

And then, we will add V,

the number of words in our vocabulary, to the denominator.

In this way, we will get correct probability distribution and it will have no zeroes.

Now, what we have just done,

so the idea is very simple.

We need to somehow pull the probability mass

from the frequent n-grams to infrequent ones.

And this is actually the only idea about all smoothing techniques.

So, in the rest of the video,

we will discover what is the best way to pull the probability

mass from frequent n-grams to infrequent n-grams.

One rather simple approach as well would be to add not one but

some k. And we can tune this constant using our test data.

It will be called, Add-k smoothing.

All these approaches are sometimes called Laplacian smoothing,

which just may be the easiest and the most popular smoothing.

Let us try to see something more complicated.

So, sometimes we would like to use longer n-grams.

It would be nice to use them but we might have not enough data to do this.

So, the idea is,

what if we try to use longer n-grams first, and then,

if we have not enough data to estimate the counts for them,

we will become not that greedy and go for shorter n-grams?

Katz backoff is an implementation of this idea.

So, let us start for example,

with a five gram language model.

If the counter is greater than zero,

then awesome, go for it.

If it's not greater than zero,

then let us be not that greedy and go for a full gram language model.

And again, if the counter is greater than zero,

then we go for it,

else we go to trigram language model.

So that is simple but I have a question for you.

Why do we have some alphas there and also tilde near the B in the if branch.

The reason is, is that we still need to care about the probabilities.

So those alpha constant is the discount that makes sure

that the probability of all sequences will sum into one in our model.

The same idea can be implemented in a different way.

So, the Interpolation smoothing says that,

let us just have the mixture of all these n-gram models for different end.

For example, we will have unigram,

bigram and trigram language models and we will weight them with

some lambda coefficients and this lambda coefficients will sum into one,

so we will still get a normal probability.

And how can we find this lambdas?

Well, we can just tune them using some test or some development set,

if we are afraid to get or fit it.

Optionally, those lambdas can also depend on

some context in their more sophisticated schemes.

Okay. We are doing great and we have just two methods left.

The first one, is called Absolute discounting.

Just to recap, the motivation for all our methods in this video is

to pull the probability mass from frequent n-grams to infrequent n-grams.

So, to what extent should we pull this probability mass?

The answer for this question can be given by a nice experiment which was held in 1991.

Let us stick to bigrams for now,

and let us see that if you count the number of

bigrams in your training data and after that,

you count the average number of the same bigrams in the test data.

Those numbers are really correlated.

So, you can see that if you're just subtract 0.75 from your train data counts,

you will get very good estimates for

your test data and this is a little bit magical property.

So, this is just a property of the language that we can try to use.

The way that we use it,

is let us subtract this D which is 0.75 or maybe,

which is tuned using some test data that is subtracted

from our counts to model the probability of our frequent n-grams.

So, this is how we pull the mass and 0.75 is this extent of pull.

Now, to give the probability to infrequent terms,

we are using here unigram distribution.

So in the right hand side,

you'll see some weight,

that makes sure that normalization is fine and the unigram distribution.

Now, can we do maybe something better than just a unigram distribution there?

And this is the idea of the Kneser-Ney smoothing.

So, let us see this example.

This is the malt or this is the Kong.

So, the word Kong might be even more popular than the word malt but the thing is,

that it can only occur in a bigram Hong Kong.

So, the word Kong is not very variative

in terms of different contexts that can go before it.

And this why, we should not prefer this word here to continue our phrase.

On the opposite, The word malt is not that popular

but it can go nicely with different contexts.

So, this idea is formalized with the formula in the top of this slide.

Let us have the probability of the words proportional

to how many different contexts can go just before the word.

So, if you take your absolute discounting model and instead of

unigram distribution have these nice distribution you will get Kneser-Ney smoothing.

Awesome. We have just covered several smoothing techniques from simple,

like, Add-one smoothing to really advanced techniques like, Kneser-Ney smoothing.

Actually, Kneser-Ney smoothing is a really strong baseline in language modeling.

So, in the next lessons we will also cover

Neural language models and we will see that it is not so easy to beat this baseline.