0:00

In this video, I'm first going to introduce a method called rprop, that is

used for full batch learning. It's like Robbie Jacobs method, but not

quite the same. I'm then going to show how to extend RPROP

so that it works for mini-batches. This gives you the advantages of rprop and it

also gives you the advantage of mini-batch learning, which is essential for large,

redundant data sets. The method that we end up with called RMS

Pro is currently my favorite method as a sort of basic method for learning the

weights in a large neural network with a large redundant data set.

I'm now going to describe rprop which is an interesting way of trying to deal with

the fact that gradients vary widely in their magnitudes.

1:13

For issues like escaping from plateaus with very small gradients this is a great

technique cause even with tiny gradients we'll take quite big steps.

We couldn't achieve that by just turning up the learning rate because then the

steps we took for weights that had big gradients would be much to big.

Rprop combines the idea of just using the sign of the gradient with the idea of

making the step size. Depend on which weight it is.

So to decide how much to change your weight, you don't look at the magnitude of

the gradient, you just look at the sign of the gradient.

But, you do look at the step size you decided around for that weight.

And, that step size adopts over time, again without looking at the magnitude of

the gradient. So we increase the step size for a weight

multiplicatively. For example by factor 1.2.

If the signs of the last two gradients agree.

This is like in Robbie Jacobs' adapted weights methods except that we did, gonna

do a multiplicative increase here. If the signs of the last two gradients

disagree, we decrease the step size multiplicatively, and in this case, we'll

make that more powerful than the increase, so that we can die down faster than we

grow. We need to limit the step sizes.

Mike Shuster's advice was to limit them between 50 and a millionth.

I think it depends a lot on what problem you're dealing with.

If for example you have a problem with some tiny inputs, you might need very big

weights on those inputs for them to have an effect.

I suspect that if you're not dealing with that kind of problem, having an upper

limit on the weight changes that's much less than 50 would be a good idea.

So one question is, why doesn't rprop work with mini-batches.

People have tried it, and find it hard to get it to work.

You can get it to work with very big mini-batches, where you use much more

conservative changes to the step sizes. But it's difficult.

So the reason it doesn't work is it violates the central idea behind

stochastic gradient descent, Which is, that when we have a small

loaning rate, the gradient gets effectively average over successive mini

batches. So consider a weight that gets a gradient

of +.01 on nine mini batches, and then a gradient of -.09 on the tenth mini batch.

What we'd like is those gradients will roughly average out so the weight will

stay where it is. Rprop won't give us that.

Rprop would increment the weight nine times by whatever its current step size

is, and decrement it only once. And that would make the weight get much

bigger. We're assuming here that the step sizes

adapt much slower than the time scale of these mini batches.

So the question is, can we combine the robustness that you get from rprop by just

using the sign of the gradient. The efficiency that you get from many

batches. And this averaging of gradients over

mini-batches is what allows mini-batches to combine gradients in the right way.

That leads to a method which I'm calling Rmsprop.

And you can consider to be a mini-batch version of rprop. rprop is equivalent to

using the gradient, But also dividing by the magnitude of the

gradient. And the reason it has problems with

mini-batches is that we divide the gradient by a different magnitude for each

mini batch. So the idea is that we're going to force

the number we divide by to be pretty much the same for nearby mini-batches. We do

that by keeping a moving average of the squared gradient for each weight.

So mean square WT means this moving average for weight W at time T,

Where time is an indicator of weight updates.

Time increments by one each time we update the weights The numbers I put in of 0.9

and 0.1 for computing moving average are just examples, but their reasonably

sensible examples. So the mean square is the previous mean

square times 0.9, Plus the value of the squared gradient for

that weight at time t, Times 0.1.

We then take that mean square. We take its square root,

Which is why it has the name RMS. And then we divide the gradient by that

RMS, and make an update proportional to that.

5:57

That makes the learning work much better. Notice that we're not adapting the

learning rate separately for each connection here.

This is a simpler method where we simply, for each connection, keep a running

average of the route mean square gradient and divide by that.

There's many further developments one could make for rmsprop. You could combine

the standard moment. My experiment so far suggests that doesn't

help as much as momentum normally does, And that needs more investigation.

You could combine our rmsprop with Nesterov momentum where you first make the

jump and then make a correction. And Ilya Sutskever has tried that recently

and got good results. He's discovered that it works best if the

rms of the recent gradients is used to divide the correction term we make rather

than the large jump you make in the direction of the accumulated corrections.

Obviously you could combine rmsprop with adaptive learning rates on each connection

which would make it much more like rprop. That just needs a lot more investigation.

I just don't know at present how helpful that will be.

And then there is a bunch of other methods related to rmsprop that have a lot in

common with it. Yann LeCun's group has an interesting

paper called No More Pesky Learning Rates that came out this year.

And some of the terms in that looked like rmsprop, but it has many other terms.

I suspect, at present, that most of the advantage that comes from this complicated

method recommended by Yann LeCun's group comes from the fact that it's similar to

rmsprop. But I don't really know that.

So, a summary of the learning methods for neural networks, goes like this.

If you've got a small data set, say 10,000 cases or less,

Or a big data set without much redundancy, you should consider using a full batch

method. This full batch methods adapted from the

optimization literature like non-linear conjugate gradient or lbfgs, or

LevenbergMarkhart,Marquardt. And one advantage of using those methods

is they typically come with a package. And when you report the results in your

paper you just have to say, I used this package and here's what it did.

You don't have to justify all sorts of little decisions.

Alternatively you could use the adaptive learning rates I described in another

video or rprop, which are both essentially full batch methods but they are methods

that were developed for neural networks. If you have a big redundant data set it's

essential to use mini batches. It's a huge waste not to do that.

The first thing to try is just standard gradient descent with momentum.

You're going to have to choose a global learning rate, and you might want to write

a little loop to adapt that global learning rate based on whether the

gradient has changed side. But to begin with, don't go for anything

as fancy as adapting individual learning rates for individual weights.

The next thing to try is RMS prop. That's very simple to implement if you do

it without momentum, and in my experiment so far, that seems to work as well as

gradient descent with momentum, would be better.

9:11

You can also consider all sorts of ways of improving rmsprop by adding momentum or

adaptive step sizes for each weight, but that's still basically uncharted

territory. Finally, you could find out whatever Yann

Lecun's latest receipt is and try that. He's probably the person who's tried the

most different ways of getting stochastic gradient descent to work well, and so it's

worth keeping up with whatever he's doing. One question you might ask is why is there

no simple recipe. We have been messing around with neural

nets, including deep neural nets, for more than 25 years now, and you would think

that we would come up with an agreed way of doing the learning.

There's really two reasons I think why there isn't a simple recipe.

9:58

First, neural nets differ a lot. Very deep networks, especially ones that

have narrow bottlenecks in them, which I'll come to in later lectures, are very

hard things to optimize and they need methods that can be very sensitive to very

small gradients. Recurring nets are another special case,

they're typically very hard to optimize, if you want them to notice things that

happened a long time in the past and change the weights based on these things

that happened a long time ago. Then there's wide shallow networks, which

are quite different in flavor and are used a lot in practice.

They often can be optimized with methods that are not very accurate.

Because we stop the optimization early before it starts overfitting.

So for these different kinds of networks, there's very different methods that are

probably appropriate. The other consideration is that tasks

differ a lot. Some tasks require very accurate weights.

Some tasks don't require weights to be very accurate at all.