In this video, we'll be talking about data normalization.

An important technique to understand in data pre-processing.

When we take a look at the used car data set,

we notice in the data that the feature length ranges from 150-250,

while feature width and height ranges from 50-100.

We may want to normalize these variables so that the range of the values is consistent.

This normalization can make some statistical analyses easier down the road.

By making the ranges consistent between variables,

normalization enables a fair comparison between the different features,

making sure they have the same impact.

It is also important for computational reasons.

Here is another example that will help you understand why normalization is important.

Consider a data set containing two features, age and income.

Where age ranges from 0-100,

while income ranges from 0-20,000 and higher.

Income is about 1,000 times larger than age and ranges from 20,000-500,000.

So, these two features are in very different ranges.

When we do further analysis,

like linear regression for example,

the attribute income will intrinsically

influence the result more due to its larger value.

But this doesn't necessarily mean it is more important as a predictor.

So, the nature of the data biases

the linear regression model to weigh income more heavily than age.

To avoid this, we can normalize

these two variables into values that range from zero to one.

Compare the two tables at the right.

After normalization, both variables now

have a similar influence on the models we will build later.

There are several ways to normalize data.

I will just outline three techniques.

The first method called simple feature scaling just

divides each value by the maximum value for that feature.

This makes the new values range between zero and one.

The second method called min-max takes

each value X_old subtract it from the minimum value of that feature,

then divides by the range of that feature.

Again, the resulting new values range between zero and one.

The third method is called z-score or standard score.

In this formula for each value you subtract the mu which is the average of the feature,

and then divide by the standard deviation sigma.

The resulting values hover around zero,

and typically range between negative three and positive three but can be higher or lower.

Following our earlier example,

we can apply the normalization method on the length feature.

First, we use the simple feature scaling method,

where we divide it by the maximum value in the feature.

Using the pandas method max,

this can be done in just one line of code.

Here's the min-max method on the length feature.

We subtract each value by the minimum of that column,

then divide it by the range of that column.

The max minus the min.

Finally, we apply the z-score method on length feature to normalize the values.

Here we apply the mean and STD method on the length feature.

Mean method will return the average value of the feature in the data set,

and STD method will return the standard deviation of the features in the data set.