0:00

[SOUND] >> This

Â lecture is about the Overview of Statistical Language Models,

Â which cover proper models as special cases.

Â In this lecture we're going to give

Â a overview of Statical Language Models.

Â These models are general models that cover

Â probabilistic topic models as a special cases.

Â So first off, what is a Statistical Language Model?

Â 0:31

A Statistical Language Model is basically a probability distribution

Â over word sequences.

Â So, for example, we might have a distribution that gives,

Â today is Wednesday a probability of .001.

Â It might give today Wednesday is, which

Â is a non-grammatical sentence, a very, very small probability as shown here.

Â 0:54

And similarly another sentence,

Â the eigenvalue is positive might get the probability of .00001.

Â So as you can see such a distribution clearly is Context Dependent.

Â It depends on the Context of Discussion.

Â Some Word Sequences might have higher probabilities than others but the same

Â Sequence of Words might have different probability in different context.

Â 1:33

And that just means we can view text data as data observed from such a model.

Â For this reason, we call such a model as Generating Model.

Â So, now given a model we can then assemble sequences of words.

Â So, for example, based on the distribution that I have shown here on this slide,

Â when matter it say assemble a sequence like today is Wednesday

Â because it has a relative high probability.

Â We might often get such a sequence.

Â We might also get the item value as positive sometimes

Â with a smaller probability and very, very occasionally we might

Â get today is Wednesday because it's probability is so small.

Â 2:24

So in general, in order to categorize such a distribution we must specify probability

Â values for all these different sequences of words.

Â Obviously, it's impossible to specify that because it's

Â impossible to enumerate all of the possible sequences of words.

Â So in practice, we will have to simplify the model in some way.

Â So, the simplest language model is called the Unigram Language Model.

Â In such a case, it was simply a the text

Â is generated by generating each word independently.

Â 3:24

So for such a model,

Â we have as many parameters as the number of words in our vocabulary.

Â So here we assume we have n words, so we have n probabilities.

Â One for each word.

Â And then some to 1.

Â So, now we assume that our text is a sample

Â drawn according to this word distribution.

Â That just means, we're going to draw a word each time and

Â then eventually we'll get a text.

Â 4:06

And some other words like eigenvalue might have a small probability, etcetera.

Â But with this, we actually can also compute the probability of

Â every sequence, even though our model only specify the probabilities of words.

Â And this is because of the independence.

Â So specifically, we can compute the probability of today is Wednesday.

Â 4:34

Because it's just a product of the probability of today,

Â the probability of is, and probability of Wednesday.

Â For example, I show some fake numbers here and when you

Â multiply these numbers together you get the probability that today's Wednesday.

Â So as you can see, with N probabilities, one for each word, we actually

Â can characterize the probability situation over all kinds of sequences of words.

Â And so, this is a very simple model.

Â Ignore the word order.

Â So it may not be, in fact, in some problems, such as for speech recognition,

Â where you may care about the order of words.

Â But it turns out to be quite sufficient for

Â many tasks that involve topic analysis.

Â And that's also what we're interested in here.

Â So when we have a model, we generally have two problems that we can think about.

Â One is, given a model, how likely are we to observe a certain kind of data points?

Â That is, we are interested in the Sampling Process.

Â The other is the Estimation Process.

Â And that, is to think of the parameters of a model given,

Â some observe the data and we're going to talk about that in a moment.

Â Let's first talk about the sampling.

Â So, here I show two examples of Water Distributions or Unigram Language Models.

Â The first one has higher probabilities for

Â words like a text mining association, it's separate.

Â 6:23

So in this case, if we ask the question about

Â what is the probability of generating a particular document.

Â Then, we likely will see text that looks like a text mining paper.

Â Of course, the text that we generate by drawing words.

Â This distribution is unlikely coherent.

Â Although, the probability of generating attacks mine

Â [INAUDIBLE] publishing in the top conference is

Â non-zero assuming that no word has a zero probability in the distribution.

Â And that just means, we can essentially generate all kinds of

Â text documents including very meaningful text documents.

Â 7:07

Now, the second distribution show,

Â on the bottom, has different than what was high probabilities.

Â So food [INAUDIBLE] healthy [INAUDIBLE], etcetera.

Â So this clearly indicates a different topic.

Â In this case it's probably about health.

Â So if we sample a word from such a distribution,

Â then the probability of observing a text mining paper would be very, very small.

Â 7:41

So that just means, given a particular distribution, different than the text.

Â Now let's look at the estimation problem now.

Â In this case, we're going to assume that we have observed the data.

Â I will know exactly what the text data looks like.

Â In this case, let's assume we have a text mining paper.

Â In fact, it's abstract of the paper, so the total number of words is 100.

Â And I've shown some counts of individual words here.

Â 8:48

What about query?

Â Well, your guess probably would be dependent on

Â how many times we have observed this word in the text data, right?

Â And if you think about it for a moment.

Â And if you are like many others, you would have guessed that,

Â well, text has a probability of 10 out of 100 because I've observed

Â the text 10 times in the text that has a total of 100 words.

Â And similarly, mining has 5 out of 100.

Â And query has a relatively small probability, just observed for once.

Â So it's 1 out of 100.

Â Right, so that, intuitively, is a reasonable guess.

Â But the question is, is this our best guess or best estimate of the parameters?

Â 9:37

Of course, in order to answer this question,

Â we have to define what do we mean by best, in this case,

Â it turns out that our guesses are indeed the best.

Â In some sense and this is called Maximum Likelihood Estimate.

Â And it's the best thing that, it will give the observer data our maximum probability.

Â 10:01

Meaning that, if you change the estimate somehow, even slightly,

Â then the probability of the observed text data will be somewhat smaller.

Â And this is called a Maximum Likelihood Estimate.

Â [MUSIC]

Â