0:00

[SOUND] >> This

lecture is about the Overview of Statistical Language Models,

which cover proper models as special cases.

In this lecture we're going to give

a overview of Statical Language Models.

These models are general models that cover

probabilistic topic models as a special cases.

So first off, what is a Statistical Language Model?

0:31

A Statistical Language Model is basically a probability distribution

over word sequences.

So, for example, we might have a distribution that gives,

today is Wednesday a probability of .001.

It might give today Wednesday is, which

is a non-grammatical sentence, a very, very small probability as shown here.

0:54

And similarly another sentence,

the eigenvalue is positive might get the probability of .00001.

So as you can see such a distribution clearly is Context Dependent.

It depends on the Context of Discussion.

Some Word Sequences might have higher probabilities than others but the same

Sequence of Words might have different probability in different context.

1:33

And that just means we can view text data as data observed from such a model.

For this reason, we call such a model as Generating Model.

So, now given a model we can then assemble sequences of words.

So, for example, based on the distribution that I have shown here on this slide,

when matter it say assemble a sequence like today is Wednesday

because it has a relative high probability.

We might often get such a sequence.

We might also get the item value as positive sometimes

with a smaller probability and very, very occasionally we might

get today is Wednesday because it's probability is so small.

2:24

So in general, in order to categorize such a distribution we must specify probability

values for all these different sequences of words.

Obviously, it's impossible to specify that because it's

impossible to enumerate all of the possible sequences of words.

So in practice, we will have to simplify the model in some way.

So, the simplest language model is called the Unigram Language Model.

In such a case, it was simply a the text

is generated by generating each word independently.

3:24

So for such a model,

we have as many parameters as the number of words in our vocabulary.

So here we assume we have n words, so we have n probabilities.

One for each word.

And then some to 1.

So, now we assume that our text is a sample

drawn according to this word distribution.

That just means, we're going to draw a word each time and

then eventually we'll get a text.

4:06

And some other words like eigenvalue might have a small probability, etcetera.

But with this, we actually can also compute the probability of

every sequence, even though our model only specify the probabilities of words.

And this is because of the independence.

So specifically, we can compute the probability of today is Wednesday.

4:34

Because it's just a product of the probability of today,

the probability of is, and probability of Wednesday.

For example, I show some fake numbers here and when you

multiply these numbers together you get the probability that today's Wednesday.

So as you can see, with N probabilities, one for each word, we actually

can characterize the probability situation over all kinds of sequences of words.

And so, this is a very simple model.

Ignore the word order.

So it may not be, in fact, in some problems, such as for speech recognition,

where you may care about the order of words.

But it turns out to be quite sufficient for

many tasks that involve topic analysis.

And that's also what we're interested in here.

So when we have a model, we generally have two problems that we can think about.

One is, given a model, how likely are we to observe a certain kind of data points?

That is, we are interested in the Sampling Process.

The other is the Estimation Process.

And that, is to think of the parameters of a model given,

some observe the data and we're going to talk about that in a moment.

Let's first talk about the sampling.

So, here I show two examples of Water Distributions or Unigram Language Models.

The first one has higher probabilities for

words like a text mining association, it's separate.

6:23

So in this case, if we ask the question about

what is the probability of generating a particular document.

Then, we likely will see text that looks like a text mining paper.

Of course, the text that we generate by drawing words.

This distribution is unlikely coherent.

Although, the probability of generating attacks mine

[INAUDIBLE] publishing in the top conference is

non-zero assuming that no word has a zero probability in the distribution.

And that just means, we can essentially generate all kinds of

text documents including very meaningful text documents.

7:07

Now, the second distribution show,

on the bottom, has different than what was high probabilities.

So food [INAUDIBLE] healthy [INAUDIBLE], etcetera.

So this clearly indicates a different topic.

In this case it's probably about health.

So if we sample a word from such a distribution,

then the probability of observing a text mining paper would be very, very small.

7:41

So that just means, given a particular distribution, different than the text.

Now let's look at the estimation problem now.

In this case, we're going to assume that we have observed the data.

I will know exactly what the text data looks like.

In this case, let's assume we have a text mining paper.

In fact, it's abstract of the paper, so the total number of words is 100.

And I've shown some counts of individual words here.

8:48

What about query?

Well, your guess probably would be dependent on

how many times we have observed this word in the text data, right?

And if you think about it for a moment.

And if you are like many others, you would have guessed that,

well, text has a probability of 10 out of 100 because I've observed

the text 10 times in the text that has a total of 100 words.

And similarly, mining has 5 out of 100.

And query has a relatively small probability, just observed for once.

So it's 1 out of 100.

Right, so that, intuitively, is a reasonable guess.

But the question is, is this our best guess or best estimate of the parameters?

9:37

Of course, in order to answer this question,

we have to define what do we mean by best, in this case,

it turns out that our guesses are indeed the best.

In some sense and this is called Maximum Likelihood Estimate.

And it's the best thing that, it will give the observer data our maximum probability.

10:01

Meaning that, if you change the estimate somehow, even slightly,

then the probability of the observed text data will be somewhat smaller.

And this is called a Maximum Likelihood Estimate.

[MUSIC]