0:00

In this video, we're going to look at a method that was developed in the late

1980's by Robbie Jacobs and then improved by a number of other people.

The idea is that each connection in the neural net should have its own adaptive

learning rate, which we set empirically by observing what happens to the weight on

that connection when we update it. So that if the weight keeps reversing its

gradient, we turn down the learning weight.

And if the gradient stays consistent, we turn up the learning weight.

So, let's start by thinking why having separate adaptive learning weights on each

connection is a good idea. The problem is, they're in a deep

multilayer net. The learning weights can vary widely

between different weights, especially between weights in different layers.

So, if for example, we start with small weights, the gradience starts from much

smaller in the initial layers than in the later layers.

1:00

Another factor that causes one different learning rate for different weights is the

fan-in of the unit. The fan-in determines the size of the

overshoot effects that you get when you simultaneously change many of the

different incoming weights to fix up the same error.

It maybe that the unit didn't get enough input, when you change all these weights

at the same time to fix up the error, it now gets too much input.

Obviously, that effect is going to be bigger if there's a bigger fan-in.

So, the net in the diagram on the right has the same fain-in for both layers more

or less the same fain-in for both layers, but that's very different in some nets.

1:41

So, the idea is that we're going to use a global learning weight which we set by

hand, and then we're going to multiply it by a local gain that is determined

empirically for each weight. A simple way to determine what those local

gains should be is to start with a local gain of one for every weight.

So that, initially we're going to change the weight, Wij, by the learning rate

times the gain of one, gij times the error derivative for that weight.

Then, what we're going to do is we're going to adapt gij.

2:15

We're going to increase gij if the gradient for the weight does not change

side. And we're going to use small additive

increases, and multiplicative decreases. So, if the gradient for the weight at time

t has the same sign as the gradient for the weight at time t minus one, with t

refers to weight updates, then when you take that product, it'll be positive.

Cuz you already get two negative gradients or two positive gradients, and then what

we're going to go is increase gij by small additive amount.

If the gradients have opposite signs, we're going to decrease gij. And because

we want to damp down gij quickly if it's already big, we're going to decrease it

multiplicatively. That ensures that big gains will decay

very rapidly if oscillation start. It's interesting to ask what would happen

if the grading was totally random. So, on each update of the weights, pick a

random gradient. Then, you'll get an equal number of

increases and decreases cuz it will equally often be the same sign as the

previous gradient or the opposite sign. And so, you'll get a bunch of additive

0.05 increases, and multiplicative 0.95 decreases, and they have an equilibrium

point which is when the gain is one. If the gain's bigger than one, the

multiplying by 0.95 will reduce it by more than adding 0.05. If the gain's smaller

than one, adding 0.05 will increase it more than multiplying by 0.95 decreases

it. So, with random gradients, we'll hover

around one. And if the gradient is consistently in the

same direction we can get much bigger than one.

If the gradient is consistently in opposite directions, which means we're

oscillating across a ravine, we can get much smaller than one.

4:11

There's a number of tricks for making the adaptive learning rates work better.

It's important to limit the size of the gains.

A reasonable range is 0.1 to ten. Or 0.1 to 100.

You don't want the gains to get huge because then you can easily get into an

instability and they won't die down fast enough, and you'll destroy all the

weights. The adaptive learning rates was designed

for full batch learning. You can also apply it with mini batches

but they had better be pretty big mini batches.

That'll ensure that the sign, changing signs of gradience aren't due to the

sampling error of mini batches, They are really due to the other side of

the ravine. There's nothing to prevent you combining

adaptive learning rates with momentum. So, Jacob suggests that, instead of using

the agreement in sign between the current gradient and the previous gradient, you

use the agreement in sign between the current gradient and the velocity for that

weight, so the accumulated gradient. And, if you do that, you get a nice

combination of the advantages of momentum, and the advantages of adaptive learning

rates. So, adaptive learning rates only deal with

axis of line defects. Whereas, momentum doesn't care about the

alignment of the axis. Momentum can deal with these diagonal

ellipses and going in that diagonal direction quickly which adaptive learning

rates can't do.