0:22

So this is a slide that you have seen earlier,

Â where we discussed the problems with using a term as a topic.

Â So, to solve these problems intuitively we need to use

Â more words to describe the topic.

Â And this will address the problem of lack of expressive power.

Â When we have more words that we can use to describe the topic,

Â that we can describe complicated topics.

Â To address the second problem we need to introduce weights on words.

Â This is what allows you to distinguish subtle differences in topics, and

Â to introduce semantically related words in a fuzzy manner.

Â Finally, to solve the problem of word ambiguity, we need to split

Â ambiguous word, so that we can disambiguate its topic.

Â 1:15

It turns out that all these can be done by using a probabilistic topic model.

Â And that's why we're going to spend a lot of lectures to talk about this topic.

Â So the basic idea here is that,

Â improve the replantation of topic as one distribution.

Â So what you see now is the older replantation.

Â Where we replanted each topic, it was just one word, or one term, or one phrase.

Â But now we're going to use a word distribution to describe the topic.

Â So here you see that for sports.

Â We're going to use the word distribution over

Â theoretical speaking all the words in our vocabulary.

Â 1:54

So for example, the high probability words here are sports,

Â game, basketball, football, play, star, etc.

Â These are sports related terms.

Â And of course it would also give a non-zero probability to some other word

Â like Trouble which might be related to sports in general,

Â not so much related to topic.

Â 2:48

You can also see, as a very special case, if the probability of the mass

Â is concentrated in entirely on just one word, it's sports.

Â And this basically degenerates to the symbol foundation

Â of a topic was just one word.

Â 3:04

But as a distribution, this topic of representation can,

Â in general, involve many words to describe a topic and

Â can model several differences in semantics of a topic.

Â Similarly we can model Travel and Science with their respective distributions.

Â In the distribution for Travel we see top words like attraction, trip, flight etc.

Â 3:31

Whereas in Science we see scientist, spaceship, telescope, or

Â genomics, and, you know, science related terms.

Â Now that doesn't mean sports related terms

Â will necessarily have zero probabilities for science.

Â In general we can imagine all of these words we have now zero probabilities.

Â It's just that for a particular topic in some words we have very,

Â very small probabilities.

Â 3:58

Now you can also see there are some words that are shared by these topics.

Â When I say shared it just means even with some probability threshold,

Â you can still see one word occurring much more topics.

Â In this case I mark them in black.

Â So you can see travel, for example, occurred in all the three topics here, but

Â with different probabilities.

Â It has the highest probability for the Travel topic, 0.05.

Â But with much smaller probabilities for Sports and Science, which makes sense.

Â And similarly, you can see a Star also occurred in Sports and

Â Science with reasonably high probabilities.

Â Because they might be actually related to the two topics.

Â So with this replantation it addresses the three problems that I mentioned earlier.

Â First, it now uses multiple words to describe a topic.

Â So it allows us to describe a fairly complicated topics.

Â Second, it assigns weights to terms.

Â So now we can model several differences of semantics.

Â And you can bring in related words together to model a topic.

Â Third, because we have probabilities for the same word in different topics,

Â we can disintegrate the sense of word.

Â In the text to decode it's underlying topic,

Â to address all these three problems with this new way of representing a topic.

Â So now of course our problem definition has been refined just slightly.

Â The slight is very similar to what you've seen before except we have

Â added refinement for what our topic is.

Â Now each topic is word distribution, and for each word distribution we know

Â that all the probabilities should sum to one with all the words in the vocabulary.

Â So you see a constraint here.

Â And we still have another constraint on the topic coverage, namely pis.

Â So all the Pi sub ij's must sum to one for the same document.

Â 5:59

So how do we solve this problem?

Â Well, let's look at this problem as a computation problem.

Â So we clearly specify it's input and

Â output and illustrate it here on this side.

Â Input of course is our text data.

Â C is our collection but we also generally assume we know the number of topics, k.

Â Or we hypothesize a number and then try to bind k topics,

Â even though we don't know the exact topics that exist in the collection.

Â And V is the vocabulary that has a set of words that determines what

Â units would be treated as the basic units for analysis.

Â In most cases we'll use words as the basis for analysis.

Â And that means each word is a unique.

Â 7:18

Now of course there may be many different ways of solving this problem.

Â In theory, you can write the [INAUDIBLE] program to solve this problem,

Â but here we're going to introduce

Â a general way of solving this problem called a generative model.

Â And this is, in fact, a very general idea and

Â it's a principle way of using statistical modeling to solve text mining problems.

Â And here I dimmed the picture that you have seen before

Â in order to show the generation process.

Â So the idea of this approach is actually to first design a model for our data.

Â So we design a probabilistic model to model how the data are generated.

Â Of course, this is based on our assumption.

Â The actual data aren't necessarily generating this way.

Â So that gave us a probability distribution of the data

Â that you are seeing on this slide.

Â Given a particular model and parameters that are denoted by lambda.

Â So this template of actually consists of

Â all the parameters that we're interested in.

Â And these parameters in general will control the behavior of

Â the probability risk model.

Â Meaning that if you set these parameters with different values and

Â it will give some data points higher probabilities than others.

Â Now in this case of course, for our text mining problem or

Â more precisely topic mining problem we have the following plans.

Â First of all we have theta i's which is a word distribution snd then we have

Â a set of pis for each document.

Â And since we have n documents, so we have n sets of pis, and each set the pi up.

Â The pi values will sum to one.

Â So this is to say that we first would pretend we already

Â have these word distributions and the coverage numbers.

Â And then we can see how we can generate data by using such distributions.

Â So how do we model the data in this way?

Â And we assume that the data are actual symbols

Â drawn from such a model that depends on these parameters.

Â Now one interesting question here is to

Â 9:32

think about how many parameters are there in total?

Â Now obviously we can already see n multiplied by K parameters.

Â For pi's.

Â We also see k theta i's.

Â But each theta i is actually a set of probability values, right?

Â It's a distribution of words.

Â So I leave this as an exercise for

Â you to figure out exactly how many parameters there are here.

Â Now once we set up the model then we can fit the model to our data.

Â Meaning that we can estimate the parameters or

Â infer the parameters based on the data.

Â In other words we would like to adjust these parameter values.

Â Until we give our data set the maximum probability.

Â I just said, depending on the parameter values,

Â some data points will have higher probabilities than others.

Â What we're interested in, here,

Â is what parameter values will give our data set the highest probability?

Â So I also illustrate the problem with a picture that you see here.

Â On the X axis I just illustrate lambda, the parameters,

Â as a one dimensional variable.

Â It's oversimplification, obviously, but it suffices to show the idea.

Â And the Y axis shows the probability of the data, observe.

Â This probability obviously depends on this setting of lambda.

Â So that's why it varies as you change the value of lambda.

Â What we're interested here is to find the lambda star.

Â 11:10

So this would be, then, our estimate of the parameters.

Â And these parameters,

Â note that are precisely what we hoped to discover from text data.

Â So we'd treat these parameters as actually the outcome or

Â the output of the data mining algorithm.

Â So this is the general idea of using

Â a generative model for text mining.

Â First, we design a model with some parameter values to fit

Â the data as well as we can.

Â After we have fit the data, we will recover some parameter value.

Â We will use the specific parameter value And

Â those would be the output of the algorithm.

Â And we'll treat those as actually the discovered knowledge from text data.

Â By varying the model of course we can discover different knowledge.

Â So to summarize, we introduced a new way of representing topic,

Â namely representing as word distribution and this has the advantage of using

Â multiple words to describe a complicated topic.It also allow us to assign

Â weights on words so we have more than several variations of semantics.

Â We talked about the task of topic mining, and answers.

Â When we define a topic as distribution.

Â So the importer is a clashing of text articles and a number of topics and

Â a vocabulary set and the output is a set of topics.

Â Each is a word distribution and

Â also the coverage of all the topics in each document.

Â And these are formally represented by theta i's and pi i's.

Â And we have two constraints here for these parameters.

Â The first is the constraints on the worded distributions.

Â In each worded distribution the probability of all the words

Â must sum to 1, all the words in the vocabulary.

Â The second constraint is on the topic coverage in each document.

Â A document is not allowed to recover a topic outside of the set of topics that

Â we are discovering.

Â So, the coverage of each of these k topics would sum to one for a document.

Â We also introduce a general idea of using a generative model for text mining.

Â And the idea here is, first we're design a model to model the generation of data.

Â We simply assume that they are generative in this way.

Â And inside the model we embed some parameters that we're interested in

Â denoted by lambda.

Â