Hi there, I'm Pavel, now we're going to discuss how linear classifiers handles the large data sets. Let me present a simple classification task with a picture. Let's say you have two different types of dots, with two different colors, and you need to spread them up using the linear surface. To predict what type of color of dot belongs top, either the first or the second one, what kind of task can they be in. For example, to find the gender of the visitor of your site to show ads that interesting for women or men. Or understand what you have in the picture, a kitten or a dog, that may not be important, but it's extremely exciting. Here is another example, you work in a financial company that gives loans. But before you give out a loan to somebody, you first need to assess its ability to pay it back. Your company must be sure that the client is not a fraud. In this video, I will teach you how to solve the problem of credit risk assessment. Besides, you will learn how to use logistic regression, working with lush data sets, and how to estimate its accuracy. So there are thousands of banks worldwide that make money, give loans, and issue credit cards to people and companies. We have a data set contains information how people pay loans back. If a credit department fails to forecast properly the cases when people can pay back loans or go into default, the bank will lose a lot of money. On the other hand, if it estimates the credit risks correctly, then the bank may make a lot of money. Though the main task of the credit department is to forecast the credit risk accurately. Namely, the probability that some people may not repay the loans. Let's load the data that we have, the credit risk data, data has 25 columns, that is a large data set. So I'm not going to show it on one screen, I just break into pieces. First, there is general information about people, the first column is identifier of a person, the second one is credit card limit. The third column is a gender, the fourth provides information about education, the higher it is, the more educated the person is. The fifth column show marital status, single or married, the sixth one gives you an age. Next there is a graded history with the bank, then we have six columns which contains information graded history, including the latest payments. Some columns have negative figures, this means that the person repaid the loan and made an extra payment on the card, overpaying it. If the person has 0 on their account or a negative value, this is a good client. If the value is positive, it means the client owes money to the bank, this is a warning sign. The following six columns shows the total amount of a person's debt as a payment history within several months on the credit card. You can see that people have different accounts and amounts to pay. One person has 3,000 and another 50,000 per month, we have data for six months, from April to September. The last set of columns shows how the person has been paying the debts in the last six months. Someone paid 1,000 rubles, another person paid 2,000, and there was a person who made a payment of 36,000 rubles to the bank. Finally, the most important column that shows information, that determines if the person will pay his loan payment next month or not. 1 is the person will miss a payment and will still owe the money, and 0 means that everything will be fine, and there will be no default on the loan. Let's move to the classification, at first glance, the task of classifiers is to predict which class we should expect, 1 or 0. Besides the classifier, can answer a second question, namely, what is the projected probability that someone may not repay their credit line. Probability is more important for the financial institutions than just 0 or 1. Using the data, a bank can decide what steps to make next. If their default probability is 0, then no steps they should take. If the default probability is 30%, perhaps the person could be forgetful and need a reminder. If the probability is about 80%, there is a need to talk with the person and ask him about the payments past due. Maybe the bank should offer a lower interest rate for the loan, and then everything will be fine. Finally, the probability make 100%, then most likely there is nothing they can do, but the bank can sue to get their money. Unfortunately, this is an unpleasant situation, but this is business implies a credit risk for banks. We can use the logistical regression to determine the probability. It sounds strange that a classifier for some reason is called regression. For the truth is that the main proposal of logistic regression is to estimate the probability. The probability is continuous quantity, and its estimate, we can also call the regression. So let's go back to our first picture, we have two different types of dots and we try to split them up using linear surface. And the surface separates 0 from 1 digits, good people from those who don't pay their loans back. And here is a particular perpendicular results points to the direction of the surface. The more a person is away from this surface, the more exactly we can predict the client's ability to pay back. Logistic regression levels out our probability and gradually makes a change from 0, or the people who will definitely repay, to 1, or people who will not repay at all. Let's introduce some notations, we can set our features using x vector that includes x1, x2 and xn, where these xs have the different features. We also use 1 as a constant term to make calculations simple. Target function, it takes 0 or 1 value, the weight filter at the same time is a perpendicular to our dividing space, it says the weight that should be multiplied by x. By the way, if we multiply x by this weight vector, then we get the projection of our x to r on our vector. Which will gie us the operating surface distance, and the site where our dot is located. The further the dot is away from the surface, the greater production value of w by x will be. On the one hand, if the dot is on the side of red dots, the value will be positive, and on the other hand, if it's on the side of blue dots, the value will be negative. We'll predict the full probability by so-called sigmoid function, it will take distance from separation surface as a param. Let's look at sigmoid function, where I observed the distance to the surface along one axis and the sigmoid value along the other. We can see as that the value tends to minus infinity, the value of sigmoid is equal to 0. And when it tends to the plus infinity, the value of sigmoid equals 1. This function is very convenient, as it mostly converts distance into probability. I'm not going to show you a logistic regression error function, because you started it last week. To find its minimum in large data sets, you can use a second order method called L-BFGS that you are already familiar with, so let's move on to the solving process. First we need to arrange our data, all these values which are stored as strings, we need to convert into a number using SQL transform. Now we need to get all the values and feature into one feature vector using a vector assembler. We use the assembler to get the features vector, and then move all columns using that select method. The new data set should have only two columns, including the assembly plane features and the labels that we will be estimating. Next we will sweep our data into the testing and the training sets, 30 percent is going to testing and 70 percent is for training. Now we train logistic regression using the feed method. Then we apply the trained regression to our data using the transform method. We build prediction for both the test and the training samples, let's see here what has happened. Transform method has added three new counts to our data, the first one is rawPrediction, what is that? This is the distance of our object to the separation surface, the distance has both positive and negative directions. We can see that both the first variable and the second variable are equal, but only if there is minus sign. The second column is the probability, namely the rawPrediction, that substituted in the sigmoid function. We can see that for each person, the probability to pay off the loan is more than 0.5. For the first row, the probability equals 0.7, if we sum the first and the second probabilities in each row, we get 1. And the total probability of all values is always 1.0. Finally, the last prediction is very simple to get, we take the y value and check if it's more than 0.5 or less. I made the prediction for the first class, let's look at the predictions. For all the predictions of the default for the first five records, the probabilities were less than 0.5, so all predictions equal 0. According to our algorithm, the first five people on the list will repay their loans. If we compare our prediction with the label that we originally have, we will see that this is not always the case. For the last two people who have numbers three and number four, we can see they actually didn't repay their loans. Let's evaluate the quality of this classifier using the binary classifier emulator. Using it, we estimate the area under the ROC curve on both testing and training samples. The area under the ROC curve is above 72%, is that bad or good? Let's look at the ROC curve, it probably looks like 72% of the area under the curve. Of course, it's much better than a random prediction, but at the same time, it's much worse than if it had the quality of 90% and 95%. In other words, the quality is not perfect, but we still can predict. So in this video, you have learned a credit risk assessment that's very important for all financial institutions in the world. In addition, you have learned how to train logistic regression when you handle a lush data set, and check its performance. That would be all for today, I'm Paulo, come back here and you will never regret.