0:00

[SOUND] Now lets look at another behaviour of the Mixed Model and

in this case lets look at the response to data frequencies.

So what you are seeing now is basically the likelihood of function for

the two word document and we now in this case the solution is text.

A probability of 0.9 and the a probability of 0.1.

Now it's interesting to

think about a scenario where we start adding more words to the document.

So what would happen if we add many the's to the document?

0:41

Now this would change the game, right?

So, how?

Well, picture, what would the likelihood function look like now?

Well, it start with the likelihood function for the two words, right?

As we add more words, we know that.

But we have to just multiply the likelihood function by

additional terms to account for the additional.

occurrences of that.

Since in this case,

all the additional terms are the, we're going to just multiply by this term.

Right? For the probability of the.

1:12

And if we have another occurrence of the, we'd multiply again by the same term,

and so on and forth.

Add as many terms as the number of the's that we add to the document, d'.

Now this obviously changes the likelihood function.

So what's interesting is now to think about how would that change our solution?

So what's the optimal solution now?

1:48

But, the question is how should we change it.

What general is to sum to one.

So he know we must take away some probability the mass from one word and

add the probability mass to the other word.

The question is which word to have reduce the probability and

which word to have a larger probability.

And in particular, let's think about the probability of the.

Should it be increased to be more than 0.1?

Or should we decrease it to less than 0.1?

What do you think?

2:19

Now you might want to pause the video a moment to think more about.

This question.

Because this has to do with understanding of important behavior of a mixture model.

And indeed, other maximum likelihood estimator.

Now if you look at the formula for a moment, then you will see it seems like

another object Function is more influenced by the than text.

Before, each computer.

So now as you can imagine, it would make sense to actually

assign a smaller probability for text and lock it.

To make room for a larger probability for the.

Why? Because the is repeated many times.

If we increase it a little bit, it will have more positive impact.

Whereas a slight decrease of text will have relatively small impact

because it occurred just one, right?

So this means there is another behavior that we observe here.

That is high frequency words generated with high probabilities

from all the distributions.

And, this is no surprise at all,

because after all, we are maximizing the likelihood of the data.

So the more a word occurs, then it makes more sense to give such a word

a higher probability because the impact would be more on the likelihood function.

This is in fact a very general phenomenon of all the maximum likelihood estimator.

But in this case, we can see as we see more occurrences of a term,

it also encourages the unknown distribution theta sub d

to assign a somewhat higher probability to this word.

4:07

Now it's also interesting to think about the impact of probability of Theta sub B.

The probability of choosing one of the two component models.

Now we've been so far assuming that each model is equally likely.

And that gives us 0.5.

But you can again look at this likelihood function and try to picture what would

happen if we increase the probability of choosing a background model.

Now you will see these terms for the,

we have a different form where the probability that would be

4:40

even larger because the background has a high probability for the word and

the coefficient in front of 0.9 which is now 0.5 would be even larger.

When this is larger, the overall result would be larger.

And that also makes this the less important for

theta sub d to increase the probability before the.

Because it's already very large.

So the impact here of increasing the probability of the is somewhat

regulated by this coefficient, the point of i.

If it's larger on the background,

then it becomes less important to increase the value.

So this means the behavior here,

which is high frequency words tend to get the high probabilities, are effected or

regularized somewhat by the probability of choosing each component.

The more likely a component is being chosen.

It's more important that to have higher values for these frequent words.

If you have a various small probability of being chosen, then the incentive is less.

So to summarize, we have just discussed the mixture model.

And we discussed that the estimation problem of the mixture model and

particular with this discussed some general behavior of the estimator and

that means we can expect our estimator to capture these infusions.

First every component model

attempts to assign high probabilities to high frequent their words in the data.

And this is to collaboratively maximize likelihood.

Second, different component models tend to bet high probabilities on different words.

And this is to avoid a competition or waste of probability.

And this would allow them to collaborate more efficiently to maximize

the likelihood.

6:33

So, the probability of choosing each component regulates the collaboration and

the competition between component models.

It would allow some component models to respond more to the change,

for example, of frequency of the theta point in the data.

6:53

We also talked about the special case of fixing one component to a background

word distribution, right?

And this distribution can be estimated by using a collection of documents,

a large collection of English documents, by using just one distribution and

then we'll just have normalized frequencies of terms to

give us the probabilities of all these words.

Now when we use such a specialized mixture model,

we show that we can effectively get rid of that one word in the other component.

7:27

This is also an example of imposing a prior on the model parameter and

the prior here basically means one model must be exactly the same as the background

language model and if you recall what we talked about in Bayesian estimation, and

this prior will allow us to favor a model that is consistent with our prior.

In fact, if it's not consistent we're going to say the model is impossible.

So it has a zero prior probability.

That effectively excludes such a scenario.

This is also issue that we'll talk more later.

[MUSIC]