In this video and in the video after this one, I wanna tell you about some of the practical tricks for making gradient descent work well. In this video, I want to tell you about an idea called feature skill. Here's the idea. If you have a problem where you have multiple features, if you make sure that the features are on a similar scale, by which I mean make sure that the different features take on similar ranges of values, then gradient descents can converge more quickly. Concretely let's say you have a problem with two features where X1 is the size of house and takes on values between say zero to two thousand and two is the number of bedrooms, and maybe that takes on values between one and five. If you plot the contours of the cos function J of theta, then the contours may look like this, where, let's see, J of theta is a function of parameters theta zero, theta one and theta two. I'm going to ignore theta zero, so let's about theta 0 and pretend as a function of only theta 1 and theta 2, but if x1 can take on them, you know, much larger range of values and x2 It turns out that the contours of the cause function J of theta can take on this very very skewed elliptical shape, except that with the so 2000 to 5 ratio, it can be even more secure. So, this is very, very tall and skinny ellipses, or these very tall skinny ovals, can form the contours of the cause function J of theta. And if you run gradient descents on this cos-function, your gradients may end up taking a long time and can oscillate back and forth and take a long time before it can finally find its way to the global minimum. In fact, you can imagine if these contours are exaggerated even more when you draw incredibly skinny, tall skinny contours, and it can be even more extreme than, then, gradient descent just have a much harder time taking it's way, meandering around, it can take a long time to find this way to the global minimum. In these settings, a useful thing to do is to scale the features. Concretely if you instead define the feature X one to be the size of the house divided by two thousand, and define X two to be maybe the number of bedrooms divided by five, then the count well as of the cost function J can become much more, much less skewed so the contours may look more like circles. And if you run gradient descent on a cost function like this, then gradient descent, you can show mathematically, you can find a much more direct path to the global minimum rather than taking a much more convoluted path where you're sort of trying to follow a much more complicated trajectory to get to the global minimum. So, by scaling the features so that there are, the consumer ranges of values. In this example, we end up with both features, X one and X two, between zero and one. You can wind up with an implementation of gradient descent. They can convert much faster. More generally, when we're performing feature scaling, what we often want to do is get every feature into approximately a -1 to +1 range and concretely, your feature x0 is always equal to 1. So, that's already in that range, but you may end up dividing other features by different numbers to get them to this range. The numbers -1 and +1 aren't too important. So, if you have a feature, x1 that winds up being between zero and three, that's not a problem. If you end up having a different feature that winds being between -2 and + 0.5, again, this is close enough to minus one and plus one that, you know, that's fine, and that's fine. It's only if you have a different feature, say X 3 that is between, that ranges from -100 tp +100 , then, this is a very different values than minus 1 and plus 1. So, this might be a less well-skilled feature and similarly, if your features take on a very, very small range of values so if X 4 takes on values between minus 0.0001 and positive 0.0001, then again this takes on a much smaller range of values than the minus one to plus one range. And again I would consider this feature poorly scaled. So you want the range of values, you know, can be bigger than plus or smaller than plus one, but just not much bigger, like plus 100 here, or too much smaller like 0.00 one over there. Different people have different rules of thumb. But the one that I use is that if a feature takes on the range of values from say minus three the plus 3 how you should think that should be just fine, but maybe it takes on much larger values than plus 3 or minus 3 unless not to worry and if it takes on values from say minus one-third to one-third. You know, I think that's fine too or 0 to one-third or minus one-third to 0. I guess that's typical range of value sector 0 okay. But it will take on a much tinier range of values like x4 here than gain on mine not to worry. So, the take-home message is don't worry if your features are not exactly on the same scale or exactly in the same range of values. But so long as they're all close enough to this gradient descent it should work okay. In addition to dividing by so that the maximum value when performing feature scaling sometimes people will also do what's called mean normalization. And what I mean by that is that you want to take a feature Xi and replace it with Xi minus new i to make your features have approximately 0 mean. And obviously we want to apply this to the future x zero, because the future x zero is always equal to one, so it cannot have an average value of zero. But it concretely for other features if the range of sizes of the house takes on values between 0 to 2000 and if you know, the average size of a house is equal to 1000 then you might use this formula. Size, set the feature X1 to the size minus the average value divided by 2000 and similarly, on average if your houses have one to five bedrooms and if on average a house has two bedrooms then you might use this formula to mean normalize your second feature x2. In both of these cases, you therefore wind up with features x1 and x2. They can take on values roughly between minus .5 and positive .5. Exactly not true - X2 can actually be slightly larger than .5 but, close enough. And the more general rule is that you might take a feature X1 and replace it with X1 minus mu1 over S1 where to define these terms mu1 is the average value of x1 in the training sets and S1 is the range of values of that feature and by range, I mean let's say the maximum value minus the minimum value or for those of you that understand the deviation of the variable is setting S1 to be the standard deviation of the variable would be fine, too. But taking, you know, this max minus min would be fine. And similarly for the second feature, x2, you replace x2 with this sort of subtract the mean of the feature and divide it by the range of values meaning the max minus min. And this sort of formula will get your features, you know, maybe not exactly, but maybe roughly into these sorts of ranges, and by the way, for those of you that are being super careful technically if we're taking the range as max minus min this five here will actually become a four. So if max is 5 minus 1 then the range of their own values is actually equal to 4, but all of these are approximate and any value that gets the features into anything close to these sorts of ranges will do fine. And the feature scaling doesn't have to be too exact, in order to get gradient descent to run quite a lot faster. So, now you know about feature scaling and if you apply this simple trick, it and make gradient descent run much faster and converge in a lot fewer other iterations. That was feature scaling. In the next video, I'll tell you about another trick to make gradient descent work well in practice.