0:00

There's an algorithm called momentum, or gradient descent with momentum

Â that almost always works faster than the standard gradient descent algorithm.

Â In one sentence, the basic idea is to compute an exponentially weighted average

Â of your gradients, and then use that gradient to update your weights instead.

Â In this video, let's unpack that one sentence description and

Â see how you can actually implement this.

Â As a example let's say that you're trying to optimize a cost function

Â which has contours like this.

Â So the red dot denotes the position of the minimum.

Â Maybe you start gradient descent here and if you take one iteration of gradient

Â descent either or descent maybe end up -heading there.

Â But now you're on the other side of this ellipse, and

Â if you take another step of gradient descent maybe you end up doing that.

Â And then another step, another step, and so on.

Â And you see that gradient descents will sort of take a lot of steps, right?

Â Just slowly oscillate toward the minimum.

Â And this up and down oscillations slows down gradient descent and

Â prevents you from using a much larger learning rate.

Â In particular, if you were to use a much larger learning rate you might end up over

Â shooting and end up diverging like so.

Â And so the need to prevent the oscillations from getting too big forces

Â you to use a learning rate that's not itself too large.

Â Another way of viewing this problem is that on the vertical axis

Â you want your learning to be a bit slower, because you don't want those oscillations.

Â But on the horizontal axis, you want faster learning.

Â 1:58

On each iteration, or more specifically,

Â during iteration t you would compute the usual derivatives dw, db.

Â I'll omit the superscript square bracket l's but

Â you compute dw, db on the current mini-batch.

Â And if you're using batch gradient descent,

Â then the current mini-batch would be just your whole batch.

Â And this works as well off a batch gradient descent.

Â So if your current mini-batch is your entire training set,

Â this works fine as well.

Â And then what you do is you

Â compute vdW to be Beta vdw

Â plus 1 minus Beta dW.

Â So this is similar to when we're previously computing

Â the theta equals beta v theta plus 1 minus beta theta t.

Â 2:57

Right, so it's computing a moving average of the derivatives for w you're getting.

Â And then you similarly compute vdb

Â equals that plus 1 minus Beta times db.

Â And then you would update your weights using W gets

Â updated as W minus the learning rate times, instead of updating it with dW,

Â with the derivative, you update it with vdW.

Â And similarly, b gets updated as b minus alpha times vdb.

Â So what this does is smooth out the steps of gradient descent.

Â 3:48

If you average out these gradients, you find that the oscillations in the vertical

Â direction will tend to average out to something closer to zero.

Â So, in the vertical direction, where you want to slow things down, this will

Â average out positive and negative numbers, so the average will be close to zero.

Â Whereas, on the horizontal direction,

Â all the derivatives are pointing to the right of the horizontal direction, so

Â the average in the horizontal direction will still be pretty big.

Â So that's why with this algorithm, with a few iterations

Â you find that the gradient descent with momentum ends up eventually just taking

Â steps that are much smaller oscillations in the vertical direction,

Â but are more directed to just moving quickly in the horizontal direction.

Â And so this allows your algorithm to take a more straightforward path,

Â or to damp out the oscillations in this path to the minimum.

Â One intuition for this momentum which works for some people, but

Â not everyone is that if you're trying to minimize your bowl shape function, right?

Â This is really the contours of a bowl.

Â I guess I'm not very good at drawing.

Â They kind of minimize this type of bowl shaped function then

Â these derivative terms you can think of as providing

Â acceleration to a ball that you're rolling down hill.

Â And these momentum terms you can think of as representing the velocity.

Â 5:20

And so imagine that you have a bowl, and you take a ball and

Â the derivative imparts acceleration to this little ball as

Â the little ball is rolling down this hill, right?

Â And so it rolls faster and faster, because of acceleration.

Â And data, because this number a little bit less than one, displays a row of

Â friction and it prevents your ball from speeding up without limit.

Â But so rather than gradient descent,

Â just taking every single step independently of all previous steps.

Â Now, your little ball can roll downhill and

Â gain momentum, but it can accelerate down this bowl and therefore gain momentum.

Â I find that this ball rolling down a bowl analogy, it seems to work for

Â some people who enjoy physics intuitions.

Â But it doesn't work for everyone, so if this analogy of a ball rolling

Â down the bowl doesn't work for you, don't worry about it.

Â Finally, let's look at some details on how you implement this.

Â Here's the algorithm and so you now have two

Â 6:22

hyperparameters of the learning rate alpha, as well as this parameter Beta,

Â which controls your exponentially weighted average.

Â The most common value for Beta is 0.9.

Â We're averaging over the last ten days temperature.

Â So it is averaging of the last ten iteration's gradients.

Â And in practice, Beta equals 0.9 works very well.

Â Feel free to try different values and

Â do some hyperparameter search, but 0.9 appears to be a pretty robust value.

Â Well, and how about bias correction, right?

Â So do you want to take vdW and vdb and divide it by 1 minus beta to the t.

Â In practice, people don't usually do this because after just ten iterations,

Â your moving average will have warmed up and is no longer a bias estimate.

Â So in practice, I don't really see people bothering with bias correction

Â when implementing gradient descent or momentum.

Â And of course this process initialize the vdW equals 0.

Â Note that this is a matrix of zeroes with the same dimension as dW,

Â which has the same dimension as W.

Â And Vdb is also initialized to a vector of zeroes.

Â So, the same dimension as db, which in turn has same dimension as b.

Â Finally, I just want to mention that if you read the literature on gradient

Â descent with momentum often you see it with this term omitted,

Â with this 1 minus Beta term omitted.

Â So you end up with vdW equals Beta vdw plus dW.

Â And the net effect of using this version in purple is that vdW ends up being

Â scaled by a factor of 1 minus Beta, or really 1 over 1 minus Beta.

Â And so when you're performing these gradient descent updates, alpha just needs

Â to change by a corresponding value of 1 over 1 minus Beta.

Â In practice, both of these will work just fine,

Â it just affects what's the best value of the learning rate alpha.

Â But I find that this particular formulation is a little less intuitive.

Â Because one impact of this is that if you end up tuning the hyperparameter Beta,

Â then this affects the scaling of vdW and vdb as well.

Â And so you end up needing to retune the learning rate, alpha, as well, maybe.

Â So I personally prefer the formulation that I have written here on the left,

Â rather than leaving out the 1 minus Beta term.

Â But, so I tend to use the formula on the left,

Â the printed formula with the 1 minus Beta term.

Â But both versions having Beta equal 0.9 is a common choice of hyper parameter.

Â It's just at alpha the learning rate would need to be tuned differently for

Â these two different versions.

Â So that's it for gradient descent with momentum.

Â This will almost always work better than the straightforward

Â gradient descent algorithm without momentum.

Â But there's still other things we could do to speed up your learning algorithm.

Â Let's continue talking about these in the next couple videos.

Â