0:07

Now we'll look at another way to estimate WNB for linear model, called Ridge Regression.

Â Ridge regression uses the same least-squares criterion,

Â but with one difference.

Â During the training phase,

Â it adds a penalty for feature weights,

Â the WI values that are too large as shown in the equation here.

Â You'll see that large weights means mathematically that

Â the sum of their squared values is large.

Â Once ridge regression has estimated the WNB parameters for the linear model,

Â the prediction of Y values for new instances is exactly the same as in least squares.

Â You just plug in your input feature values,

Â the XIs and compute the sum of

Â the weighted feature values plus B with the usual in your formula.

Â So why would something like ridge regression be useful?

Â This addition of a penalty term to

Â a learning algorithm's objective function is called Regularisation.

Â Regularisation is an extremely important concept in machine learning.

Â It's a way to prevent overfitting, and thus,

Â improve the likely generalization performance of a model,

Â by restricting the models possible parameter settings.

Â Usually the effect of this restriction from regularisation,

Â is to reduce the complexity of the final estimated model.

Â So how does this work with linear regression?

Â The addition of the sum of squared parameter values that's shown in the box,

Â to the least-squares objective means that models with

Â larger feature weights (w) add more to the objective functions overall value.

Â Because our goal is to minimize the overall objective function,

Â the regularisation term acts as a penalty of

Â models with lots of large feature weight values.

Â In other words, all things being equal,

Â if ridge regression finds

Â two possible linear models that predict the training data values equally well,

Â it will prefer the linear model that has

Â a smaller overall sum of squared feature weights.

Â The practical effect of using ridge regression,

Â is to find the feature weights WI that fit the data well in at least square sense,

Â and that set lots of the feature weights two values that are very small.

Â We don't see this effect with a single variable linear regression example,

Â but for regression problems with dozens or hundreds of features,

Â the accuracy improvement from using

Â regularized linear regression like ridge regression could be significant.

Â The amount of regularisation to apply is controlled by the alpha parameter.

Â Larger alpha means more regularization and

Â simpler linear models with weights closer to zero.

Â The default setting for alpha is 1.0.

Â Notice that setting alpha to zero corresponds to the special case of

Â ordinary least-squares linear regression that we saw earlier,

Â that minimizes the total square here.

Â In scikit learn, you use rich regression by importing

Â the ridge class from sklearn.linear model.

Â And then use that estimate or object just as you would for least-squares.

Â The one difference is that you can specify

Â the amount of the ridge regression regularisation penalty,

Â which is called the L2 penalty,

Â using the alpha parameter.

Â Here, we're applying ridge regression to the crime data set.

Â Now, you'll notice here that the results are not that impressive.

Â The R-squared score on the test set is pretty

Â comparable to what we got for least-squares regression.

Â However there is something we can do in

Â applying ridge regression that will improve the results dramatically.

Â So now is the time for a brief digression about the need for

Â feature preprocessing and normalization.

Â Let's stop and think for a moment intuitively,

Â what ridge regression is doing.

Â It's regularizing the linear regression by imposing

Â that sum of squares penalty on the size of the W coefficients.

Â So the effect of increasing alpha is to

Â shrink the AW coefficients toward zero and towards each other.

Â But if the input variables, the features,

Â have very different scales,

Â then when this shrinkage happens of the coefficients,

Â input variables with different scales will have

Â different contributions to this L2 penalty,

Â because the L2 penalty is a sum of squares of all the coefficients.

Â So transforming the input features,

Â so they're all on the same scale,

Â means the ridge penalty is in some sense applied more fairly to all features

Â without unduly weighting some more than others,

Â just because of the difference in scales.

Â So more generally, you'll see as we proceed through the course that feature

Â normalization is important to perform for a number of different learning algorithms,

Â beyond just regularized regression.

Â This includes K-nearest neighbors,

Â support vector machines, neural networks and others.

Â The type of feature preprocessing and

Â normalization that's needed can also depend on the data.

Â For now, we're going to apply a widely used form of

Â future normalization called MinMax Scaling,

Â that will transform all the input variables,

Â so they're all on the same scale between zero and one.

Â To do this, we compute

Â the minimum and maximum values for each feature on the training data,

Â and then apply the minmax transformation for each feature as shown here.

Â Here's an example of how it works with two features.

Â Suppose we have one feature "height" whose values fall in

Â a fairly narrow range between 1.5 and 2.5 units.

Â But a second feature,

Â "width" has a much wider range between five and 10 units.

Â After applying minmax scaling,

Â values for both features are transformed because they are on the same scale,

Â with the minimum value getting mapped to zero,

Â and the maximum value being transformed to one.

Â And everything else getting transformed to a value between those two extremes.

Â To apply minmax scaling,

Â in scikit-learn, you import the minmax scalar object from sklearn.preprocessing.

Â To prepare the scalar object for use, you create it,

Â and then call the fit method using the training data Xtrain.

Â This will compute the min and max feature values

Â for each feature in this training dataset.

Â Then to apply the scalar,

Â you call it transform method,

Â and pass in the data you want to rescale.

Â The output will be the scale version of the input data.

Â In this case, we want to scale the training data and save it

Â in a new variable called Xtrain scaled.

Â And the test data,

Â saving that into a new variable called X-Tests-Scaled.

Â Then, we just use

Â these scaled versions of the feature data instead of the original feature data.

Â Note that it could be more efficient to perform

Â fitting and transforming in a single step on the training set,

Â by using the scalers fit transform method as shown here.

Â There's one last but very important point here,

Â about how to apply minmax scaling or any kind of

Â feature normalization in a learning scenario with training and test sets.

Â You may have noticed two things here.

Â First, that we're applying the same scalar object to both the training and the testing.

Â And second, that we're training

Â the scalar object on the training data and not on the test data.

Â These are both critical aspects to feature normalization.

Â If you don't apply the same scaling to training and test sets,

Â you'll end up with more or less random data skew,

Â which will invalidate your results.

Â If you prepare the scaler or other normalization method by

Â showing it the test data instead of the training data,

Â this leads to a phenomenon called Data Leakage,

Â where the training phase has information that is leaked from the test set.

Â For example, like the distribution of extreme values for each feature in the test data,

Â which the learner should never have access to during training.

Â This in turn can cause the learning method to give

Â unrealistically good estimates on the same test set.

Â We'll look more at the phenomenon of data leakage later in the course.

Â One downside to performing feature normalization is that

Â the resulting model and the transformed features may be harder to interpret.

Â Again, in the end,

Â the type of feature normalization that's best to apply,

Â can depend on the data set,

Â learning task and learning algorithm to be used.

Â We'll continue to touch on this issue throughout the course.

Â Okay, let's return to

Â ridge regression after we've added the code for minmax scaling of the input features.

Â We can see the significant effect of

Â minmax scaling on the performance of ridge regression.

Â After the input features have been properly scaled,

Â ridge regression achieves significantly better model fit

Â with an R-squared value on the test set of about 0.6.

Â Much better than without scaling,

Â and much better now than ordinary least-squares.

Â In fact if you apply the same minmax scaling with ordinary least-squares regression,

Â you should find that it doesn't change the outcome at all.

Â In general, regularisation works especially

Â well when you have relatively small amounts of

Â training data compared to the number of features in your model.

Â Regularisation becomes less important as the amount of training data you have increases.

Â We can see the effect of varying the amount of

Â regularisation on the scale to training and

Â test data using different settings for alpha in this example.

Â The best R-squared value on the test set is achieved with an alpha setting of around 20.

Â Significantly larger or smaller values of alpha,

Â both lead to significantly worse model fit.

Â This is another illustration of the general relationship between

Â model complexity and test set performance that we saw earlier in this lecture.

Â Where there's often an intermediate best value of a model of

Â complexity parameter that does not lead to either under or overfitting.

Â Another kind of regularized regression that you

Â could use instead of ridge regression is called Lasso Regression.

Â Like ridge regression, lasso regression adds

Â a regularisation penalty term to the ordinary least-squares objective,

Â that causes the model W-coefficients to shrink towards zero.

Â Lasso regression uses a slightly different regularisation term called an L1 penalty,

Â instead of ridge regression's L2 penalty as shown here.

Â The L1 penalty looks kind of similar to the L2 penalty,

Â in that it computes a sum over the coefficients but it's

Â some of the absolute values of the W-coefficients instead of a sum of squares.

Â And the results are noticeably different.

Â With lasso regression, a subset of the coefficients are forced to be precisely zero.

Â Which is a kind of automatic feature selection,

Â since with the weight of zero the features are

Â essentially ignored completely in the model.

Â This sparse solution where only a subset of

Â the most important features are left with non-zero weights,

Â also makes the model easier to interpret.

Â In cases where there are more than a few input variables.

Â Like ridge regression, the amount of regularisation for

Â the lasso regression is controlled by the parameter alpha,

Â which by default is zero.

Â Also like ridge regression,

Â the purpose of using lasso regression is to estimate the WNB model coefficients.

Â Once that's done, the prediction model formula is the same as for ordinary least-squares,

Â you just use the linear model.

Â In general, lasso regression is most helpful if you think there are

Â only a few variables that have a medium or large effect on the output variable.

Â Otherwise if there are lots of variables that contribute small or medium effects,

Â ridge regression is typically the better choice.

Â Let's take a look at lasso regression in scikit-learn using the notebook,

Â using our communities in crime regression data set.

Â To use lasso regression,

Â you import the lasso class from sklearn.linear model,

Â and then just use it as you would use an estimator like ridge regression.

Â With some data sets you may occasionally get a convergence warning,

Â in which case you can set the max_iter attribute to a larger value.

Â So typically at least 20,000, or possibly more.

Â Increasing the max inter-parameter will increase the computation time accordingly.

Â In this example, we're applying lasso to

Â a minmax scale version of the crime data set as we did for ridge regression.

Â You can see that for Alpha set to 2.0,

Â only 20 features with non-zero weights remain because with lasso regularisation,

Â most of the features are set to have weights of exactly zero.

Â I've listed the features with non-zero weights in

Â order of their descending magnitude from the output.

Â Although we need to be careful in interpreting

Â any results for data on a complex problem like crime,

Â the lasso regression results do help us see some of

Â the strongest relationships between

Â the input variables and outcomes for this particular data set.

Â For example, looking at the top five features with

Â non-zero weight that are found by lasso regression,

Â we can see that location factors like percentage of people in dense housing,

Â which indicates urban areas and socio economic variables like

Â the fraction of vacant houses in an area are positively correlated with crime.

Â And other variables like the percentage of families with

Â two parents is negatively correlated.

Â Finally, we can see the effect of tuning

Â the regularisation parameter alpha for lasso regression.

Â Like we saw with ridge regression,

Â there's an optimal range for alpha that gives

Â the best test set performance that neither under or over fits.

Â Of course this best alpha value will be different for different data sets,

Â and depends on various other factors such as the feature

Â preprocessing methods being used.

Â Let's suppose for a moment that we had a set of

Â two-dimensional data points with features X0 and X1.

Â Then we could transform each data point by adding additional features that

Â were the three unique multiplicative combinations of X0 and X1.

Â So, X0 squared, X0,

Â X1 and X1 squared.

Â So we've transformed our original two-dimensional points into a set of

Â five-dimensional points that rely only on the information in the two-dimensional points.

Â Now we can write a new regression problem that tries to predict

Â the same output variable y-hat but using these five features instead of two.

Â The critical insight here is that this is still a linear regression problem.

Â The features are just numbers within a weighted sum.

Â So we can use the same least-squares techniques to estimate

Â the five model coefficients for these five features that we

Â used in these simpler two-dimensional case.

Â Now, why would we want to do this kind of transformation?

Â Well, this is called polynomial future transformation that

Â we can use to transform a problem into a higher dimensional regression space.

Â And in effect, adding these extra polynomial features allows us a

Â much richer set of complex functions that we can use to fit to the data.

Â So you can think of this intuitively as allowing

Â polynomials to be fit to the training data instead of simply a straight line,

Â but using the same least-squares criterion that minimizes mean squared error.

Â We'll see later that this approach of adding

Â new features like polynomial features is also very effective with classification.

Â And we'll look at this kind of transformation

Â again in kernelized support vector machines.

Â When we add these new polynomial features,

Â we're essentially adding to the model's ability to capture interactions

Â between the different variables by adding them as features to the linear model.

Â For example, it may be that housing prices vary as

Â a quadratic function of both the lat size that a house sits on,

Â and the amount of taxes paid on the property as a theoretical example.

Â A simple linear model could not capture this nonlinear relationship,

Â but by adding nonlinear features like polynomials to the linear regression model,

Â we can capture this nonlinearity.

Â Or generally, we can use other types of

Â nonlinear feature transformations beyond just polynomials.

Â This is beyond the scope of this course but technically these are

Â called nonlinear basis functions for regression,

Â and are widely used.

Â Of course, one side effect of adding lots of new features

Â especially when we're taking every possible combination of K variables,

Â is that these more complex models have the potential for overfitting.

Â So in practice, polynomial regression is

Â often done with a regularized learning method like ridge regression.

Â Here's an example of polynomial regression using scikit-learn.

Â There's already a handy class called polynomial features in the

Â sklearn.preprocessing module that will generate these polynomial features for us.

Â This example shows three regressions on a more complex regression dataset,

Â that happens to have some quadratic interactions between variables.

Â The first regression here,

Â just uses least-squares regression without the polynomial feature transformation.

Â The second regression creates the polynomial features object with degrees set to two,

Â and then calls the fit transform method of

Â the polynomial features object on the original XF1 features,

Â to produce the new polynomial transform features XF1 poly.

Â The code then calls ordinary least-squares linear regression.

Â You can see indications of overfitting on this expanded feature representation,

Â as the models r-squared score on the training set is

Â close to one but much lower on the test set.

Â So the third regression shows the effect of adding

Â regularisation via ridge regression on this expanded feature set.

Â Now, the training and tests r-squared scores are basically the same,

Â with the test set score of

Â the regularized polynomial regression

Â performing the best of all three regression methods.

Â