Hello everyone, my name is Pavel and now we'll talk about how machine learning algorithms are trained on big data. What exactly happens when you call the feed function? So far, the topic has remained a blank box for us, now we will open this box and find what's inside. My goal's to explain you the intuition of the training methods that are being used at the moment. In this video, I will talk about learning algorithms and show you three most popular methods that work with distributed data sets. The first is an analytical solution to the problem, secondly, there is a method of gradient descend, and thirdly, the method of stochastic gradient descent. Here we go, here's our linear regression, for which we are trying to predict the values of labels for different sets of features. We try to do this in an optimal way, what does the optimum mean? When observed how well the linear regression work, we measure its quality with the mean square error, RMSE, here is it. Let's draw this line as so to use this error, let's find the minimum RMSE. However, we cannot do it without math, so it's time to introduce some mathematical notations. Let's label the features by x, and the labels are y, we will mark the example number in training set with an upper index i. x is a vector, and y is a integer, so minus infinity to plus infinity. We want to build a prediction of how many bikes today will go to the rental. Prediction is linear combination of input features with some weights, w0, wn. Let's add 1 to the vector x and w0 into the feature vector. And then we will be able to write down our equation in a short form. That y, the product of the feature vector and the vector. So we'll search for the optimal radial through the minimum of this function, which is called the error function, what kind of function is it? The fact is the same mean square error, it's just results square extraction and dividing by n, each with the minimum as the same point as RMSE. Our task is to find this vector of parameters which tests this minimum. The solution of the problem can be found in several ways. The very first and very obvious method is an analytical solution. Let's show how the error function depends on the parameters in the one dimensional case. On the x axis, either the parameter is aside, and the mean of the error function along the y axis. Here is the minimum of the function, at the minimum point, the derivative is zero for each of the coordinates. The derivative can be found analytically, simply by taking the derivative of our function, and equating it to zero, it will be in this form. In fact, we have obtained a system of linear equations, which can be solved as a school task. Everything goes fine, operations are fluent, and linear regression can be closed, but what are disadvantages? The main problem here, the complexity of solution depends cubically on the number of features. If you have 10 features, then you will have about 1,000 operation, 2,000, 5,000, maybe 10,000. If you have 100 features, then you will have 1 million operations, 5 million, 10 million, it doesn't matter. And if you have 1000 features and it's not that limit of machine learning, then the number of operation will reach billions, which is quite a lot. The second problem is, this method is not universal, it is suitable for searching for linear regression, but the method doesn't work for classification. I would like, of course, to find a silver bullet, and use it to solve different problems. And the silver bullet has already been found, it's called gradient descent. For a function of many variables, we can take the derivative with our variables, and compose the gradient vector. This vector has one remarkable property, at any point where it is counted, it points to the direction where the function grows fastest, but it also has another property. If you add a minus to it, then the vector will indicate the direction where the function decrease most quickly, and this can be used to find our solution. We take our current value, start with some random point, and subtract gradient from it. Gradient is multiplied by a factor that specifies the rate of convergence. If its multiplier is too large, we'll always miss the center, if it's too small, we will never reach it, and get the algorithm counted when pigs fly, that's how it works out. And here is an example of a two dimensional surface, but the circles I have designated is also called level lines. Let our arrow be 10,000 for external level line, 5,000 for the second level line, 1,000 for this third level line, 500 for the fourth, and 300 for the fifth. And with each iteration, our point, our vector of features approach closer and closer to our minimum value. How is this calculated in practice, we have our data frame with features and labels. The features are x, as I have already mentioned, and the labels are y. For each pair, for each coordinate we can calculate the gradient value separately. At this point, we combine everything to get full value of the gradient, and make one gradient step with it. What are the advantages of this method, it is universal, and it's complex proportion to the number of features, grows linearly. But there is one problem, the method can become stuck in the local medium as the spit of converge is not the greatest. Therefore, another method was invented, or rather, the improvement of previous method, which is called stochastic gradient method. Gradient descent looks like this, in the stochastic gradient method, we start by mixing all the data that we have. And then for each point, we count the same difference that we have. And this difference is just our gradient, through which we make our gradient descent, constructing from our current weight of features. That's method I am talking about, here's such a method, what advantages does it have? Converts much faster, because in the usual gradient descend, in one pass are calling to the data, we make a one-step approximation, we need several iterations to learn. In stochastic gradient, for the same one pass in the whole data set, has time to build some fast approximation, so the acceleration of of convergence. The second advantage is that the chances to get into the global minimum, and not the local one, is much higher. If the function has a global minimum, the gradient descent can get stuck in it. A stochastic gradient can jump out of it, due to its random walks. And the third advantage is online training, that is, data can come to us gradually. Such an approach will take into account how the price of car rental business evolves over time. And the last one, there was another improvement in the stochastic gradient. The fact that almost no one uses a stochastic gradient on any single point. The stochastic gradient considers by small bars of complete data frame, which are called mini batch. We calculate on it our gradient, and with the help of the update, our vector. Then we consider gradient on another mini batch, and again update our feature record, and so on. That's all I wanted to tell you today, you have learned how the problem of learning is formulated. How it's solved analytically, and how it can be solved by a gradient and stochastic gradient, I hope that it was useful. In the next video, I will tell you even more advanced machine learning techniques on big data that converge even faster, and give it more accuracy, stay with us.