0:00

[SOUND] Now lets look at another behaviour of the Mixed Model and

Â in this case lets look at the response to data frequencies.

Â So what you are seeing now is basically the likelihood of function for

Â the two word document and we now in this case the solution is text.

Â A probability of 0.9 and the a probability of 0.1.

Â Now it's interesting to

Â think about a scenario where we start adding more words to the document.

Â So what would happen if we add many the's to the document?

Â 0:41

Now this would change the game, right?

Â So, how?

Â Well, picture, what would the likelihood function look like now?

Â Well, it start with the likelihood function for the two words, right?

Â As we add more words, we know that.

Â But we have to just multiply the likelihood function by

Â additional terms to account for the additional.

Â occurrences of that.

Â Since in this case,

Â all the additional terms are the, we're going to just multiply by this term.

Â Right? For the probability of the.

Â 1:12

And if we have another occurrence of the, we'd multiply again by the same term,

Â and so on and forth.

Â Add as many terms as the number of the's that we add to the document, d'.

Â Now this obviously changes the likelihood function.

Â So what's interesting is now to think about how would that change our solution?

Â So what's the optimal solution now?

Â 1:48

But, the question is how should we change it.

Â What general is to sum to one.

Â So he know we must take away some probability the mass from one word and

Â add the probability mass to the other word.

Â The question is which word to have reduce the probability and

Â which word to have a larger probability.

Â And in particular, let's think about the probability of the.

Â Should it be increased to be more than 0.1?

Â Or should we decrease it to less than 0.1?

Â What do you think?

Â 2:19

Now you might want to pause the video a moment to think more about.

Â This question.

Â Because this has to do with understanding of important behavior of a mixture model.

Â And indeed, other maximum likelihood estimator.

Â Now if you look at the formula for a moment, then you will see it seems like

Â another object Function is more influenced by the than text.

Â Before, each computer.

Â So now as you can imagine, it would make sense to actually

Â assign a smaller probability for text and lock it.

Â To make room for a larger probability for the.

Â Why? Because the is repeated many times.

Â If we increase it a little bit, it will have more positive impact.

Â Whereas a slight decrease of text will have relatively small impact

Â because it occurred just one, right?

Â So this means there is another behavior that we observe here.

Â That is high frequency words generated with high probabilities

Â from all the distributions.

Â And, this is no surprise at all,

Â because after all, we are maximizing the likelihood of the data.

Â So the more a word occurs, then it makes more sense to give such a word

Â a higher probability because the impact would be more on the likelihood function.

Â This is in fact a very general phenomenon of all the maximum likelihood estimator.

Â But in this case, we can see as we see more occurrences of a term,

Â it also encourages the unknown distribution theta sub d

Â to assign a somewhat higher probability to this word.

Â 4:07

Now it's also interesting to think about the impact of probability of Theta sub B.

Â The probability of choosing one of the two component models.

Â Now we've been so far assuming that each model is equally likely.

Â And that gives us 0.5.

Â But you can again look at this likelihood function and try to picture what would

Â happen if we increase the probability of choosing a background model.

Â Now you will see these terms for the,

Â we have a different form where the probability that would be

Â 4:40

even larger because the background has a high probability for the word and

Â the coefficient in front of 0.9 which is now 0.5 would be even larger.

Â When this is larger, the overall result would be larger.

Â And that also makes this the less important for

Â theta sub d to increase the probability before the.

Â Because it's already very large.

Â So the impact here of increasing the probability of the is somewhat

Â regulated by this coefficient, the point of i.

Â If it's larger on the background,

Â then it becomes less important to increase the value.

Â So this means the behavior here,

Â which is high frequency words tend to get the high probabilities, are effected or

Â regularized somewhat by the probability of choosing each component.

Â The more likely a component is being chosen.

Â It's more important that to have higher values for these frequent words.

Â If you have a various small probability of being chosen, then the incentive is less.

Â So to summarize, we have just discussed the mixture model.

Â And we discussed that the estimation problem of the mixture model and

Â particular with this discussed some general behavior of the estimator and

Â that means we can expect our estimator to capture these infusions.

Â First every component model

Â attempts to assign high probabilities to high frequent their words in the data.

Â And this is to collaboratively maximize likelihood.

Â Second, different component models tend to bet high probabilities on different words.

Â And this is to avoid a competition or waste of probability.

Â And this would allow them to collaborate more efficiently to maximize

Â the likelihood.

Â 6:33

So, the probability of choosing each component regulates the collaboration and

Â the competition between component models.

Â It would allow some component models to respond more to the change,

Â for example, of frequency of the theta point in the data.

Â 6:53

We also talked about the special case of fixing one component to a background

Â word distribution, right?

Â And this distribution can be estimated by using a collection of documents,

Â a large collection of English documents, by using just one distribution and

Â then we'll just have normalized frequencies of terms to

Â give us the probabilities of all these words.

Â Now when we use such a specialized mixture model,

Â we show that we can effectively get rid of that one word in the other component.

Â 7:27

This is also an example of imposing a prior on the model parameter and

Â the prior here basically means one model must be exactly the same as the background

Â language model and if you recall what we talked about in Bayesian estimation, and

Â this prior will allow us to favor a model that is consistent with our prior.

Â In fact, if it's not consistent we're going to say the model is impossible.

Â So it has a zero prior probability.

Â That effectively excludes such a scenario.

Â This is also issue that we'll talk more later.

Â [MUSIC]

Â