0:13

In this lecture we're going to continue talking about how to

Â do text categorization and cover discriminative approaches.

Â This is a slide that you have seen from the discussion of Naive Bayes Classifier,

Â where we have shown that although Naive Bayes Classifier tries to model

Â the generation of text data, from each categories, we can actually use Bayes'

Â rule to eventually rewrite the scoring function as you see on this slide.

Â And this scoring function is basically a weighted combination of a lot

Â of word features, where the feature values are word counts, and the feature weights

Â are the log of probability ratios of the word given by two distributions here.

Â 0:57

Now this kind of scoring function can be actually a general scoring

Â function where we can in general present text data as a feature vector.

Â Of course the features don't have to be all the words.

Â Their features can be other signals that we want to use.

Â And we mentioned that this is precisely similar to logistic regression.

Â So, in this lecture we're going to introduce some discriminative classifiers.

Â They try to model the conditional distribution of

Â labels given the data directly rather than using Bayes' rule

Â to compute that interactively as we have seen in naive Bayes.

Â So the general idea of logistic regression is to model

Â the dependency of a binary response variable Y

Â on some predictors that are denoted as X.

Â So here we have also changed the notation

Â to X for future values.

Â 2:13

And here we use the notation of X factor,

Â which is more common when we introduce

Â such discriminative algorithms.

Â So, X is our input.

Â It's a vector with n features and each feature has a value x sub i here.

Â And I will go with a model that dependency of this binary response variable of

Â these features.

Â So in our categorization problem when I have two categories theta 1 and

Â theta 2, and we can use the Y value to denote the two categories when Y is 1,

Â it means the category of the document, the first class, is theta 1.

Â Now, the goal here is the model, the conditional property of Y given X directly

Â as opposed to model of the generation of X and Y as in the case of Naive Bayes.

Â And another advantage of this kind of approach is that

Â it would allow many other features than words to be used in this vector

Â since we're not modeling the generation of this vector.

Â And we can plug in any signals that we want.

Â So this is potentially advantageous for doing text categorization.

Â So more specifically, in logistic regression,

Â assume the functional form of Y depending on X is the following.

Â And this is very closely related to the log

Â odds that I introduced in the Naive Bayes or log of probability ratio

Â of the two categories that you have seen on the previous slide.

Â 5:22

And so in logistic regression,

Â we're basically assuming that the probability of Y equals 1.

Â Okay my X is dependent on this linear combination of all these features.

Â So it's just one of the many possible ways, assuming that the dependency.

Â But this particular form has been quite useful and

Â it also has some nice properties.

Â 5:47

So if we rewrite this equation to actually express the probability of Y given X.

Â In terms of X by getting rid of the logarithm we get this functional form,

Â and this is called a logistical function.

Â It's a transformation of X into Y, as you see

Â 6:41

So as in all cases of model we would be interested in estimating the parameters.

Â And in fact in all of the machine running programs, once you set up with the model,

Â set up object and function to model the file,

Â then the next step is to compute the parameter values.

Â In general, we're going to adjust to these parameter values.

Â Optimize the performance of classify on the training data.

Â So in our case just assume we have the training data here, xi and yi, and

Â each pair is basically a future vector of x and a known label for that x.

Â Y is either 1 or 0.

Â So in our case we are interested maximize this conditional likelihood.

Â 7:31

The conditional likelihood here is

Â basically to model why given observe the x,

Â so it's not like a moderate x, but

Â rather we're going to model this.

Â Note that this is a conditional probability of Y given X and

Â this is also precisely what we wanted For classification.

Â Now so the likelihood function would be just a product of all the training cases.

Â And in each case,

Â this is the model of the probability of observing this particular training case.

Â So given a particular Xi, how likely we are to observe the corresponding Yi?

Â Of course, Yi could be 1 or 0, and in fact,

Â the function found here would vary depending on whether Yi is 1 or 0.

Â If it's a 1, we'll be taking this form.

Â And that's basically the logistic regression function.

Â But what about this, if it's 0?

Â Well, if it's 0, then we have to use a different form, and that's this one.

Â 8:55

And you can easily see this.

Â Now the key point in here is that the function form here depends on the observer

Â Yi, if it's a 1, it has a different form than when it's 0.

Â And if you think about when we want to maximize this probability,

Â we're basically going to want this probability to be as high as possible.

Â When the label is 1, that means the document is in probability 1.

Â But if the document is not, we're going to maximize this value,

Â and what's going to happen is actually to make this value as

Â small as possible because this sum's 1.

Â When I maximize this one, it's equivalent to minimize this one.

Â 10:00

So as another occasion, when you compute the maximum likelihood data,

Â basically you'll find a beta value,

Â a set of beta values that would maximize this conditional likelihood.

Â 10:12

And this, again, then gives us a standard optimization problem.

Â In this case, it can be also solved in many ways.

Â Newton's method is a popular way to solve this problem,

Â there are other methods as well.

Â But in the end, we will look at a set of data values.

Â Once we have the beta values, then we have a way to find the scoring

Â function to help us classify a document.

Â 10:39

So what's the function?

Â Well, it's this one.

Â See, if we have all the beta values, are they known?

Â All we need is to compute the Xi for that document and then plug in those values.

Â That will give us an estimated probability that the document is in category one.

Â 10:59

Okay so, so much for logistical regression.

Â Let's also introduce another discriminative classifier

Â called K-Nearest Neighbors.

Â Now in general, I should say there are many such approaches, and

Â a thorough introduction to all of them is clearly beyond the scope of this course.

Â And you should take a machine learning course or

Â read more about machine learning to know about them.

Â Here, I just want to include the basic introduction to some of the most commonly

Â used classifiers, since you might use them often for text calculation.

Â So the second classifier is called K-Nearest Neighbors.

Â In this approach, we're going to also estimate

Â the conditional probability of label given data, but in a very different way.

Â So the idea is to keep all the training examples and

Â then once we see a text object that we want to classify, we're going to find

Â the K examples in the training set and that are most similar to this text object.

Â Basically, this is to find the neighbors of this text objector in

Â the training data set.

Â So once we found the neighborhood and

Â we found the object that are close to the object we are interested in classifying,

Â and let's say we have found the K-Nearest Neighbors.

Â That's why this method is called K-Nearest Neighbors.

Â Then we're going to assign the category that's most common in these neighbors.

Â Basically we're going to allow these neighbors to vote for

Â the category of the objective that we're interested in classifying.

Â 12:43

This approach can also be improved by considering the distance of a neighbor and

Â of a current object.

Â Basically, we can assume a closed neighbor would have more say

Â about the category of the subject.

Â So, we can give such a neighbor more influence on the vote.

Â And we can take away some of the votes based on the distances.

Â 13:06

But the general idea is look at the neighborhood, and

Â then try to assess the category based on the categories of the neighbors.

Â Intuitively, this makes a lot of sense.

Â But mathematically, this can also be regarded as a way to directly estimate

Â there's a conditional probability of label given data, that is p of Y given X.

Â 13:28

Now I'm going to explain this intuition in a moment, but before we proceed, let me

Â emphasize that we do need a similarity function here in order for this to work.

Â Note that in naive base class five, we did not need a similarity function.

Â And in logistical regression, we did not talk about those similarity function

Â either, but here we explicitly require a similarity function.

Â Now this similarity function actually is a good opportunity for

Â us to inject any of our insights about the features.

Â Basically effective features are those that would

Â make the objects that are on the same category look more similar, but

Â distinguishing objects in different categories.

Â So the design of this similarity function is closely tied it to the design

Â of the features in logistical regression and other classifiers.

Â So let's illustrate how K-NN works.

Â Now suppose we have a lot of training instances here.

Â And I've colored them differently and to show just different categories.

Â Now suppose we have a new object in the center that we want to classify.

Â So according to this approach, you work on finding the neighbors.

Â Now, let's first think of a special case of finding just one neighbor,

Â the closest neighbor.

Â 14:53

Now in this case, let's assume the closest neighbor is the box filled with diamonds.

Â And so then we're going to say, well, since this is in this

Â object that is in category of diamonds, let's say.

Â Then we're going to say, well,

Â we're going to assign the same category to our text object.

Â But let's also look at another possibility of finding a larger neighborhood,

Â so let's think about the four neighbors.

Â 15:26

In this case, we're going to include a lot of other solid field boxes in red or

Â pink, right?

Â So in this case now, we're going to notice that among the four neighbors,

Â there are three neighbors in a different category.

Â So if we take a vote,

Â then we'll conclude the object is actually of a different category.

Â So this both illustrates how can nearest neighbor works and

Â also it illustrates some potential problems of this classifier.

Â Basically, the results might depend on the K and indeed,

Â k's an important parameter to optimize.

Â Now, you can intuitively imagine if we have a lot of neighbors

Â around this object, and then we'd be okay because we have

Â a lot of neighbors who will help us decide the categories.

Â But if we have only a few, then the decision may not be reliable.

Â So on the one hand, we want to find more neighbor, right?

Â And then we have more votes.

Â But on the other hand, as we try to find more neighbors we actually could risk

Â on getting neighbors that are not really similar to this instance.

Â They might actually be far away as you try to get more neighbors.

Â So although you get more neighbors but those neighbors aren't necessarily so

Â helpful because they are not very similar to the object.

Â So the parameter still has to be set empirically.

Â And typically, you can optimize such a parameter by using cross validation.

Â Basically, you're going to separate your training data into two parts and

Â then you're going to use one part to actually help you choose

Â the parameter k here or some other parameters in other class files.

Â And then you're going to assume this number that works well on your

Â training that will be actually be the best for your future data.

Â 17:23

So as I mentioned,

Â K-NN can be actually regarded as estimate of conditional problem within y given x

Â an that's why we put this in the category of discriminative approaches.

Â So the key assumption that we made in this approach is that the distribution

Â of the label given the document probability a category given for

Â example probability of theta i given document d is locally smooth.

Â And that just means we're going to assume that this probability is the same for

Â all the documents in these region R here.

Â And suppose we draw a neighborhood and we're going to assume in this neighborhood

Â since the data instances are very similar we're going to assume that

Â the conditional distribution of the label given the data will be roughly the same.

Â If these are very different then we're going to assume that

Â the probability of c doc given d would be also similar.

Â So that's a very key assumption.

Â And that's actually important assumption

Â that would allow us to do a lot of machinery.

Â But in reality,

Â whether this is true of course, would depend on how we define similarity.

Â Because neighborhood is largely determined by our similarity function.

Â If our similarity function captures objects that do follow similar

Â distributions then these assumptions are okay but

Â if our similarity function could not capture that, obviously these

Â assumption would be a problem and then the classifier would not be accurate.

Â 18:59

Okay, let's proceed with these assumption.

Â Then what we are saying is that,

Â in order to estimate the probability of category given a document.

Â We can try to estimate the probability of the category given that entire region.

Â Now, this has a benefit, of course,

Â of bringing additional data points to help us estimate this probability.

Â 19:22

And so this is precisely the idea of K-NN.

Â Basically now we can use the known categories of

Â all the documents in this region to estimate this probability.

Â And I have even given a formula here where you can see we just count the topics in

Â this region and then normalize that by the total number of documents in the region.

Â So the numerator that you see here, c of theta i and r,

Â is a counter of the documents in region R was category theta i.

Â Since these are training document and we know they are categories.

Â We can simply count how many times it was since here.

Â How many times we have the same signs, etc.

Â And then the denominator is just the total number of training

Â documents in this region.

Â So this gives us a rough estimate of which categories most popular in this

Â neighborhood.

Â And we are going to assign the popular category

Â to our data object since it falls into this region.

Â [MUSIC]

Â