Hi, my name is Vladimir. And in the following videos, we'll talk about two important topics, how to simplify your model and how to simplify the data at hand. Namely, we're going to talk about regularisation and a few unsupervised techniques, PCA decomposition and K-means clustering. While unsupervised learning is considered to be more of an art than science, the formal one regularization is one of the key concepts in machine learning. And we're going to start with it. Let's first understand why simplicity is so important and formally define what we usually mean by simple model. In case of building a supervised machine learning model, the general principle of keeping things simple can be expressed in the following way. Always choose the simplest model among those with the same test sample error. Moreover, simple model actually tend to have a smaller test error which we'll discuss a bit later. Now, let's think of the meaning of simple on a few examples. Simplicity of linear model which is effectively at least of its weights, can be define as number of non-zero weights. In case of decision trees, the simpler model is the one containing less nodes and less splits. Going forward, we'll be talking more about linear models, as they were extensively discussed in the previous lesson. Among the most important reasons for using simpler models is their relative inability to over-fit. Just to remind you, over-fitting is a situation random model fits too well to training data sample, and makes rather poor predictions on the test sample. Another important reason is easier maintenance of simple models. For example, they are easier to interpret, and interpretation is crucial for some applications. Also sometimes models are used to tens of thousands of predictions per second. An efficiency of this variation becomes really important. So, I hope it was convincing enough, and now you're ready to think on how you can achieve it. Let's be more specific and think of a particular task. You have number of features and a binary outcome variable y which you want to predict. One of the basic ideas that comes in mind is to restrict the number of features allowed for the modelling. The most simple way is to make use of statistical tests that verified hypothesis of independence of each feature and y in turn. The main statistical test for this task Is chi-squared and F-test. So one can simply disregard the features deemed as independent from Y. But the problem here is that you can't go further and choose the good subset among the rest of variables, because the results of statistical tests are incompatible. You simply can't say which dependency is stronger. Turns out there is a better way for restraining a model, which works by modification of a learning process. To understand how it works, let's remember how models are actually learned. Consider a logistic regression model for our task. The learning boils down to minimizing the arithmetic loss function, L(w), on the training sample. Intuitively, what you want is to constrain the process, and make it converge towards more simple solution, even if the value of L(w) is suboptimal. To achieve that, let's introduce additional so-called regularization term to our task, by simply adding it to L(w). This term depends on the model weight and should penalize complexity of the model. There are numerous ways to choose this function, but probably the first that comes to mind is just the squared length of w vector. Note that we deliberately don't include intercept w0 here. Using this regulizer is actually called L2 regularization and lambda here controls the extent to which regularization is applied. One good property of L2 regularization is smoothness, which allows us to apply the previously discussed optimization methods for L(w) prime. A good way to think about regularization is that now it becomes expensive to have large coefficients in the model. So large weights will be associated with those features benefit from which outweigh the penalty term. This, at least intuitively, looks like a reasonable thing to do. Turns out in many application it really helps with generalization. To be precise, you can often find a good value of lamda that increases the accuracy on a test sample. Even though having small weight is good, it's even better to have more weight exactly equal to zero. One of the most common ways to achieve that is to use L1 regularization term. The only difference from the previous case is that now you're using weight modulus instead of squares. I guess it's not really obvious why would such a function guarantee sparsity of the solution. The reason comes from the fact that now there is no derivative in 0 for our function. So now 0 becomes a kind of a special point in our optimization task. Needless to say that vanilla gradient descent methods need adaptation to be able to deal with the task. There's quite a few ways to make this adaptation. Not going too deep into details, Spark uses the solution based on the so-called proximal operators. It boils down to using gradient descent update coupled with special projection operation. The projection makes weights that deviate not too far from 0 to become 0 again, which implies sparsity in the initial model. To go further, there's useful generalization, which is called elastic net. It's basically a complex combination of the previous two regularization terms. Setting Alpha to 0 or 1, you can get vanilla L2 or L1 correspondingly. So what else is important to know about regularization? Turns out that the way the features are scaled can significantly affect the result. Imagine for example two features, one of which has wide range of values going all the way from o to 100, while the second one changes only from 0 to 1. Obviously, the slight modification of the corresponding weights have different outcome on prediction. What is wrong here is this both these coefficients get penalized in the same way. We can't wait to overcome this obstacle easily scale features. There are two common ways, the standardization and mean mark scaling. The former implies division by standard deviation and the optional mean subtraction. The latter is a linear transformation that makes minimum value and maximum value to become 0 and 1. Another thing that I've never mentioned yet is the way lambda and alpha values should be set. I guess that the answer is simple. They should be treated as hyperparameters of the model, and fitted against the data. Now, I suggest you to take a look at the example considered in the previous lecture and add regularization to the model. So, here you can see the Spark code that fits the linear regression model to the bicycle data. Note that I explicitly set the value of elastic net parameter and the regularization parameter to 0, so you could get an unregularized model. Add to score of this model on a test sample is around 075. Note that some weight parameters go as high as 10,000. To set out two regularization you should leave the value of alpha equal to zero while increasing the value of regparam which stands for lambda. Here you can see that with lambda equal to 100, you get a different solution with slightly better score. This time you can see that the value of rates at most 3,400, which is an expected outcome of using L2. Net, let's use L1 regularization, setting the alpha equal to 1 and the regParam=50. The resulting score is the same as the initial model. Though, you can see that two weight have ended up being exactly 0, which, again, is an expected outcome. So in this video, we've learned what are L1 and L2 regularization and the reasons for using them. Important to remember that L1 not only helps fight over-fitting, but also introduces sparsity which is a useful property. They think of regularization is really important as it often apply to other clauses of models such as support vector machines or neural networks