0:00

In this video, we're going to look at stochastic gradient descent learning for a

neural network, Particularly the mini batch version, which

is probably the most widely used learning algorithm for large neural networks.

We've seen this before, but let's start with a reminder about what the error

surface looks like for a linear neuron. The error surface means a surface that

lies in a space where the horizontal axes correspond to the weights of the neural

net. And the vertical axis corresponds to the

error it makes. For a linear neuron with a squared error,

that surface always forms a quadratic bowl.

The vertical cross sections are parabolas, and the horizontal cross sections are

ellipses. For multilayer non linear nets the error

surface is much more complicated, But as long as the weights aren't to big

it's a smooth error surface, and locally it's well approximated by a fraction of a

quadratic bowl. It might not be the bottom of the bowl but

there's a piece of quadratic bowl that will fit the local error surface very

well. If we look at the conversion speed when we

do full-batch learning, when the error surface is a quadratic bubble,

The obvious thing to do is go downhill, this will reduce the error.

But the problem is, that the direction of steepest descent does not point to the

place we want to go to. As you can see in the ellipse, the

direction of steepest descent is almost at rectangles to the direction we want to go

in. You've got a gradient that's very big

across the ellipse, which is the direction which we only want to travel a small

distance, and the gradient's very small along the ellipse, and that's the

direction which we want to travel a large distance.

It's precisely the wrong way around. Now you might think that studying linear

systems like this, is not a good idea if you want to optimize big non-linear nets.

But even for these non-linear multi-line nets, this kind of a problem arises.

It's a very similar problem that arises even though the error surfaces aren't

globally quadratic bowls. Locally they have all these same kind of

properties. That is they tend to be very curved in

some directions, and very uncurved in other directions.

2:19

So the way the learning goes wrong if you use a big learning rate is that you slash

to and fro in the directions in which the area surface is very curved.

So we'll say call that slashing across a ravine.

And with the line rate too big you'll actually diverge.

What we want to achieve, is that we go quickly along the ravine in directions

that have small, but very consistent gradients.

And we move slowly in directions with these big, but very inconsistent

gradients. That is if you go in that direction for a

short distance, the gradient will reverse sign.

3:00

Before we go into how we achieve that, I need to talk a little bit about stochastic

gradient descent, and the motivation for using it.

If you have a data set that's highly redundant, then if you compute the

gradient for a weight on the first half of the data set, you'll get almost exactly

the same answer as you get if you compute the gradient on the second half.

So it's a complete waste of time to compute the gradient on the whole data

set. You'd be much better off computing the

gradient on a subset of the data, then updating the weights and on the remaining

data, computing the gradient for the updated weights.

We can take that to extremes and say we're going to compute the gradient on a single

training case, we're going to update the weights and then we're going to compute

the gradient on the next training case using those new weights.

That's called online learning. In general, we don't want to go quite that

far. It's usually better to use small mini

batches, typically ten or a 100 or even 1000 examples. One advantage of a small

mini batch, is that less computation is used for actually updating the weights,

cuz you do that less often, compared with online.

Another advantage is that when you compute the gradient, you can compute the gradient

for a whole bunch of cases in parallel. Most computers are very good at doing

matrix, matrix multiplies, and that will allow you to consider a whole bunch of

training cases and apply the weights to a whole bunch of training cases at the same

time to figure out the activities going into the next layer for all of those

training cases. That gives you a matrix, matrix multiply,

and it's very efficient, especially on a graphics processor unit.

One point about using mini batches is you wouldn't want to have a mini batch in

which the answer is always the same and then on the next mini batch have a

different answer that's always the same. That would cause the weights to slosh

unnecessarily. The ideal, if you have say ten classes,

would be to have a mini batch with say ten examples or 100 examples, that has exactly

the same number from each class in the mini batch.

One way to approximate that is simply to take all your data and just put it in

random order and grab random mini-batches. But you must avoid having mini batches

that are very uncharacteristic of the whole set of data because the mini-batches

are all of one class. So basically there's two types of learning

algorithms for neural nets. There's full gradient algorithms, where

you compute the gradient from all of the training cases.

And once you've done that, there's a lot of clever ways to speed up learning.

There's things like nonlinear versions of a method called conjugate gradient.

The optimization community has been studying the general problem of how you

optimize smooth nonlinear functions for many years.

Now multi-layer neural networks are pretty untypical of the kinds of problems they

study. So applying the methods they developed may

need a lot of modification to make them work for these multi-layer neural

networks. But when you have highly redundant and

large training sets, it's nearly always better to use mini batch learning.

The mini batches may need to be quite big, But that's not so bad because big mini

batches are more computationally efficient.

6:30

I'm now going to describe a basic mini-batch grading descent linear

algorithm. This is what most people would use when

they started training a big neural net on a big redundant data set.

Tyou start by guessing an initial learning rate,

And you look to see if the network learned satisfactorily or if the error keeps

getting worse, oscillates wildly. If that happens, you reduce the learning

rate. You also look to see if the error is

falling too slowly. You expect that the error might fluctuate

a bit if you measure it on a validation set, because the great electronic

mini-batch is just a rough estimate of the over all gradient.

So you don't want to reduce the learning rate every time the error arises.

But what you're hoping is that the error will fall fairly consistently.

And if it is falling fairly consistently and very slowly, you can probably increase

the learning rate. Once you've got that working, you can then

write a simple program to automate that way of adjusting the learning rate.

One thing that nearly always helps is, towards the end of learning with

mini-batches. It helps to turn down the learning rate.

That's because you're going to get fluctuations in the weights caused by the

fluctuations in the gradients that come from the mini batches.

And you'd like a final set of weights. As a good compromise.

So, when you turn down the learning rate, you're smoothing away those fluctuations,

and getting a final set of weights that's good for many mini-batches.