0:00

In this video we're going to look at the momentum method for improving the learning

speed when doing grading descent into neural network.

The momentum method can be applied to full batch learning, but it also works for mini

batch learning. It's very widely used.

And probably the commonest recipe for learning big neural nets is to use

stochastic grade and descent with mini batches combined with momentum.

I'm going to start with the intuition behind the momentum method.

0:42

The ball starts off stationary and so initially it will follow the direction of

steepest descent. It will follow the gradient.

But as soon as it's got some velocity it'll no longer go in the same direction

as the gradient. Its momentum will make it keep going in

the previous direction. Obviously we wanted eventually to get to a

low point on the surface, so we wanted to lose energy.

So we need to introduce a bit of viscosity.

That is, we make its velocity die off gently on each update.

1:15

What the momentum method does, is it damps oscillations in directions of high

curvature. So if you look at the red starting point,

and then look at the green point we get to after two steps, they have gradients that

are pretty much equal and opposite. As a result, the gradient across the

ravine has cancelled out. But the gradient along the ravine has not

cancelled out. Along the ravine, we're going to keep

building up speed, and so, after the momentum method has settled down, it'll

tend to go along the bottom of the ravine, accumulating velocity as it goes, and if

you're lucky, that'll make you go a whole lot faster, than if you just judge

steepest descent. The equations of the momentum method are

fairly simple. We say that the velocity vector at time t,

is just the velocity vector at time t minus one, time here is the updates of the

weights. So it's the velocity vector that we got

after mini batch t minus one, attenuated a bit.

So we multiply by some number like point.9.

Which is really viscosity, or it's related to viscosity.

But unfortunately, I called it momentum. So we now call alpha momentum.

And then we add in the effect of the current gradient,

Which is to make us go downhill by some learning rate times the gradient that we

have at time t And that'll be our new velocity at time t We then make our weight

change at time t equal to velocity. That velocity can actually be expressed in

terms of previous weight changes as it's shown on the slide share.

Then I will leave it to you to follow the math.

3:07

The behavior of the momentum method is very intuitive.

On an air surface that's just a plane, the ball will reach some terminal velocity of

which the gaining velocity that comes from the gradient is balanced by the

multiplicative attenuation of velocity due to the momentum term,

Which is really viscosity. If that momentum term is close to one,

then it'll be going down much faster than a simple gradient descent method would.

3:40

So the terminal velocity, the velocity you get at time infinity is the gradient times

the learning weight, multiplied by this factor of one over one minus alpha.

So if alpha is 0.99, you'll go 100 times as fast as you would with the learning

rate alone. You have to be careful in setting

momentum. At the very beginning of learning, if you

make the initial random weights quite big, there may be very large gradients.

You have a bunch of weights that's completely no good for the task you're

doing. And it may be very obvious how to change

these weights to make things a lot better. You don't want a big momentum.

Because you're going to quickly change them to make things better.

And then you're going to start on the hard problem of finding out how to get just the

right relative values of different weights.

So you have sensible feature detectors. So it pays at the beginning of learning to

have a small momentum. It is probably better to have 0.5 than

zero, because 0.5 will average out some sloshes and obvious ravines.

Once the large gradients have disappeared, and you've reached the sort of normal

phase of learning, where you're stuck in a ravine.

And you need to go along the bottom of this ravine without sloshing to and fro

sideways. You can smoothly raise the momentum to its

final value. Or you could raise it in one step, but

that might start an oscillation. You might think that, why didn't we just

use a bigger learning rate. But what you'll discover is that, using a

small learning rate and a big momentum allows you to get away with an overall

learning rate that's much bigger than you could have had if you used learning rate

alone with no momentum. If you use a big learning rate by itself,

you'll get big divergent oscillations] across the ravine.

5:34

Very recently Ilya Sutskever has discovered that there's a better type of

momentum. The standard momentum method works by

first computing the gradient at the current location.

It combines that with its stored memory of previous gradients, which is in the

velocity of the ball. And then it takes a big jump in the

direction of the current gradient combined with previous gradients.

So that's its accumulated gradient direction.

6:06

Ilya Sutskever has found that it works better in many cases to use a form of

momentum suggested by Nesterov who was trying to optimize convex functions, where

we first make a big jump in the direction of the previous accumulating gradient, and

then we measure the gradient where we end up and make a correction.

It's very, very similar, and you need a picture to really understand the

difference. One way of thinking about what's going on

is in the standard momentum method, you add in the current gradient and then you

gamble on this big jump. In the Nesterov method, you use your

previously accumulated gradient, you make the big jump and then you correct yourself

at the place you've got to. So here's the picture, when we first make

the jump and then make a correction. Here is a stamp in the direction of the

accumulated gradient. So this depends on the gradient that we've

accumulated on, in our previous iteration. We take that step.

We then make it the gradient, and go downhill in the direction of the gradient.

Like that. We then combine that little correction

stat with the big jump we made to get our new accumulated gradient.

7:33

We then take that accumulated gradient, we attenuate it by some number, like nine.

Or 99. multiply it by that number, and we now take our next big jump in the

direction of that accumulated gradient, like that.

Then again, at the place where we end up, we measure the gradient and we go

downhill. That correct any errors you made, and we

our new accumulated gradient. Now if you compare that with the standard

momentum method, the standard momentum method starts with a accumulating

gradient, like that initial brand vector, but then it measures the gradient where it

is, so it measures the gradient at its current location, and it adds that to the

brown vector, so that it makes a jump like this big blue vector.

That is just the brown vector plus the current gradient.

It turns out, if you're going to gamble, it's much better to gamble and then make a

correction, than to make a correction and then gamble.