0:02

In this video, we will discuss how to adapt linear methods for classification problems.

Â And let's start with simplest classification problem, binary classification.

Â Here we have only two values for the target: minus one and one;

Â negative class and positive class -- so, essentially,

Â linear model calculus dot product between w,

Â weight vector, and x, feature vector.

Â This dot product is a real value,

Â and we should somehow transform it to minus one or one,

Â and to do it, we can just take a sine of dot product.

Â So linear classifier looks like sine of w transposed by x.

Â It has deep parameters.

Â And if you remember,

Â we agreed that there is also a constant feature that has a value of one on every example.

Â So we don't have to explicitly include bias into our model.

Â The coefficient of this constant feature will be the bias.

Â So actually, maybe there are d+1 parameters for this model,

Â and geometrically it looks like that.

Â Suppose that we have two features,

Â so x's on this graph correspond to our features,

Â and we denote negative class by red points and positive class by blue points.

Â And the linear model tries to find

Â some line that separates blue points from the red points.

Â And as we know from geometry,

Â the sine of the dot product indicates on which side of the line the point lies.

Â So, if we have a positive dot product,

Â then the point lies at the positive side of this line,

Â and if the product is negative,

Â then the point lies on the negative side of our line.

Â Okay.

Â Let's switch to multi-class classification problem with K classes 1- K. In this case,

Â we should use some more complicated techniques to build our classificator.

Â One of the most popular approaches is to build a separate classifier for each class.

Â So, for example, for the first class,

Â we'll have a linear model -- linear classifier -- that

Â separates points of the first class from all other points.

Â So essentially, we try to fit a model so that

Â points of the first class lie on the positive side of this line of this hyperplane,

Â and points from all other classes lie on the negative side of this hyperplane.

Â And the dot product of this model is essentially a score.

Â The higher the score,

Â the more the model is confident that this point lies in the first class.

Â Then we build such a model for every class,

Â and we have K linear models,

Â and each model calculates a score,

Â and then we assign our new example to the class that

Â has the largest score -- the class with higher confidence.

Â For example, if we have three classes,

Â and our score vector looks like 7-7.5 and 10,

Â then we assign our example to the third class,

Â because a third component of the score vector is the largest.

Â Okay. Now we have the model,

Â and we should somehow learn it.

Â So we need a loss function.

Â And let's start with the simplest loss function,

Â accuracy loss, and to define it,

Â we'll need some notation -- Iverson brackets.

Â They denote just basic square brackets,

Â and we write some logical statement inside the brackets.

Â If the statement is true,

Â then the value of brackets is one,

Â and if the statement is false,

Â then the value of brackets is zero.

Â So, now let's define accuracy metric.

Â Let's take an example xi,

Â find the prediction a of xi,

Â and compare it to the true value of class

Â yi and write the equality of predicted class and true class in Iverson brackets.

Â So, the value of the bracket will be one if we guessed the class correctly,

Â and then it will be zero if we are misclassifying these points.

Â And then we just average these brackets over

Â all our data points -- over all our training set.

Â So, essentially, accuracy is just a ratio of

Â correctly classifying points in our training set.

Â This metric is good and could be easily interpreted,

Â but it has two large disadvantages.

Â At first, we'll learn from our next videos that we need

Â a gradient to optimize our loss function effectively.

Â And accuracy doesn't have gradients with respect to model parameters.

Â So we cannot optimize it -- we cannot learn the model to accuracy score.

Â And also, this model doesn't take into

Â account the confidence of our model in this prediction.

Â So actually, we have a dot product of weight vector and feature vector w and x,

Â and the larger the score,

Â the more the model is confident in this prediction.

Â If this dot product has a positive sign and a large value,

Â then the model is confident.

Â But if the sign is positive,

Â but the value is close to zero,

Â then the model is inconfident.

Â And we want not only a model that makes

Â correct decisions -- that gets its classes -- but we want a confident model,

Â and it's known from machine learning that models with high confidence generalize better.

Â Okay. Accuracy doesn't fit,

Â so we need some other loss function.

Â Maybe we can use mean squared error.

Â Suppose that we have some example, xi,

Â and it belongs to the positive class,

Â to the class one,

Â and consider a squared loss on this example.

Â So we take dot product between w and x and compare it to one,

Â and take a square of this difference.

Â So, if our model predicts one,

Â then the guess is correct and the loss is zero.

Â If our model gives a prediction between zero and one,

Â then it's inconfident in its decision and we penalize it for low confidence.

Â If the model gives the value lower than zero,

Â then it misclassifies this point.

Â So we give it an even larger penalty.

Â That's okay, but if the model predicts a value larger than one, then we penalize it.

Â We penalize it for high confidence,

Â and that's not very good.

Â We should give small or zero loss for high-confidence decisions.

Â Okay. So we can just take one branch of

Â our squared loss and penalize for low confidence and for misclassification,

Â and give zero loss for high confidence.

Â Actually, there are many loss functions that look like this one,

Â and all of them lead to their own classification methods.

Â We'll discuss one of the most important-for-us methods, logistic regression.

Â And to talk about it, we should first find a way to

Â convert our scores from linear classifiers,

Â to probabilities, to distribution.

Â So, we have some vector of scores z,

Â which has components w transposed by x,

Â though these are scores for each of our classes.

Â Dot products can have any sign and have any magnitude.

Â So we cannot interpret them as probability distributions,

Â and we should somehow change it.

Â We'll do it in two steps.

Â At first step, we take first component of

Â our vector and take e to the degree of this component.

Â We do the same to the second component,

Â et cetera, to the last component.

Â So, after this step,

Â we have a vector e to the degree of z that has only positive coordinates.

Â So now we need only to normalize these components to get a distribution.

Â And to do it, we just sum all the components of

Â this e-to-z vector and divide each component by the sum.

Â And after that, we get a vector sigma of z that is

Â normalized and has only non-negative components.

Â So we can interpret it as a probability distribution.

Â This transform is called a softmax function -- a softmax transform.

Â Consider an example with three classes.

Â We score 7-7.5 and 10.

Â If we apply softmax transform to this vector,

Â then we get the vector sigma of z with components 0.05, zero, and 0.95.

Â So the first component was largest before the transform,

Â and it has the largest probability after softmax transform.

Â Okay. Now we have an approach to transform our scores to probabilities.

Â This is the predicted probabilities of classes.

Â And now we need some target vector,

Â the vector that we want our probabilities to be equal to.

Â Of course, we want the probability of the true class to be one,

Â and probabilities of all other classes to be equal to zero.

Â So,we'd form a vector, b.

Â It's a target vector that is

Â just a binary vector of the size K where K is number of classes,

Â and it has one in the component that

Â corresponds to the true class of the current example,

Â and zeros in all other coordinates.

Â Now, we have target vector b,

Â vector of predicted class probabilities sigma of z,

Â and we should somehow measure the distance between these probability distributions.

Â To do it, we can use cross entropy.

Â Essentially, cross entropy is

Â just a minus log of the predicted class probability for the true class.

Â And also, we can write it as a minus sum of the indicator that our class y equals

Â to K multiplied by log of the predicted class probability for

Â the class K. Let's look at some examples of cross entropy.

Â Suppose that we have three classes,

Â and our example belongs to the first class.

Â So, y equals to one.

Â Suppose that we have some model that predicts probability of one to the first class,

Â and zero probabilities to the second and third classes.

Â So, this model makes a correct guess,

Â and the cross entropy is zero,

Â because it's a perfect model for us.

Â If we have a model that makes a prediction of 0.5 for

Â the first class and 0.25 for two other classes,

Â then the cross entropy equals approximately to 0.7.

Â So there is some loss here.

Â But if I have a model that assigns the probability of one to

Â the second class and zero probability to the first class,

Â then the cross entropy equals to plus infinity,

Â because I multiply one by the logarithm of zero.

Â So, cross entropy gives

Â a very high penalty for models that are confident in wrong decisions.

Â Okay. Now we can just sum cross entropies over all examples from our training set,

Â and that would be our loss function.

Â It's quite complicated and we cannot find analytical solution for this problem.

Â So, we need some numerical methods to optimize it,

Â and we'll discuss this method in the following videos.

Â So in this video, we discussed how to apply linear models to classification problems,

Â both through binary classification and multi-class classification,

Â and discussed how loss for classification problems should look like.

Â One of the most important methods for learning linear classifiers is logistic regression,

Â and we discussed how a loss looks in this case.

Â And in the next video, we'll talk about

Â gradient descent numerical method that optimizes any differentiable loss function.

Â