0:00

In this video, we are going to look at a number of issues that arise when using

stochastic gradient descent with mini patches.

There is a large number of tricks that make things work much better.

These are the kind of black outed neural networks.

And I'm going to go over some of the main tricks in this video.

The first issue I want to talk about, is initializing the way it's in your own

network. If two hidden units have exactly the same

weights, the same bioses, with incoming and I current, then they can never become

different from one another. Because they would always get exactly the

same gradient. So, to allow them to learn diffrent

feature detectors, you need to start them off different from one another.

We do this by using small random weights to initialize the weights.

That breaks the symmetry. Those small random weights umm shouldn't

all necessarily be the same size as each other.

So if you've got a hidden unit that has a very big fan in if you use quite big

weights it'll tend to saturate it so you can afford to use much smaller weights for

a hidden unit that has a big fan in. If you have a hidden unit with a very

small fan, then you want to use bigger weights.

And since the weights are random, it scales with the square root of the number

of the weights. And so a good principle is to make the

size of the initial weights be proportional to the square root of the

fan. We can also scale the learning rates for

the weights the same way. One thing that has a surprisingly big

affect on the speed with which a neural network will learn, is shifting the

inputs. That is adding a constant to each of the

components of the inputs. It seems surprising that, that could make

much difference. But when you're using steepest decent,

shifting an input value by adding a constant can make a very big difference.

It usually helps to shift each component of the input, so that averaged over all of

the training data, it has a value of zero. That is, make sure it's mean is zero.

2:05

So suppose we have a little neuron-like likeness, just a linear neuron with two

weights. And suppose we have some training cases.

The first training case is where the inputs are 101 and a 101, you should give

an output of two. And the second one says when there are a

101 and 99 you should output a zero. And I'm using color here to indicate which

training case I'm talking about If you look at the error surface you get for

those two training cases, it looks like this.

The green line is the line along which the weights will satisfy the first training

case, and the red line is the line along which the weights will satisfy the second

training case. And what we notice is that they're almost

parallel, and so when you combine them, you get a very elongated ellipse.

One way to think about what's going on here is that, because we're using a

squared error measure, we get a parabolic trough along the red line.

The red line is the bottom of this parabolic trough that tells us the squared

error we'll be getting on the red case. And there's another parabolic trough with

the green line along its bottom. And it turns out, although this may

surprise your spatial intuition. If you add together two parabolic troughs,

you get a quadratic bowl. And elongated quadratic bowl, in this

case. So that's where that error surface came

from. Now, look what happens, if we subtract a

hundred from each of those two inbook components.

We get a completely different area surface.

It's, in this case, it's a circle, it's ideal.

The green line is the line along which the weights add to two.

We're going to take the first weight, and multiply it by one.

We're going to take the second weight and multiply it by one.

And we need to get two. So the weights better add to two.

The red line is the line along which the two weights are equal.

Because we're going to take the first weight, and multiply it by one.

And we're going to take the second weight, and multiply it by -one.

So if the weights are equal, we'll be able to get that zero that we need.

4:25

If you're thinking about what happens not with the input but with the hidden units.

It makes sense to have hidden units that are hyperbolic tangents that go between

-one and one. The hyperbolic tangent is simply twice the

logistic -one. And the reason that makes sense is because

then the activities of the hidden units are roughly mean zero and that should make

the learning faster in the next level. Of course, that's only true if the inputs

to the hyperbolic tangents are distributed sensibly around zero.

5:01

But in that respect, a hyperbolic tangent is better than a logistic.

However there is other respects in which a logistic is better.

For example, logistic gives you a rug to sweep things under.

It gives an output of zero, and if you make the input even smaller than it was,

the output is still zero. So fluctuations in big native inputs are

ignored by the logistic. For the hyperbolic tangent you have to go

out to the end of its plateaus before it can ignore anything.

5:30

Another thing that makes a big difference is scaling the inputs.

When we use the steepest descent, scaling the input values is a very simple thing to

do. We transform them so that each component

of the input has unit variance over the whole training set.

So it has a typical value of one or -one. So, again if we take this simple net with

two rates and we look at the error surface when the first component is very small and

the second component is much bigger. We get an error surface in which we get an

ellipse that has got a very high curvature, when the input components big

because small changes in the weight make a big difference in the output.

And very low curvature in the direction in which the input component is small because

small changes to the weight hardly make any difference to the error.

The color here is indicating which axis we're using, not which training example

we're using, as it did in the previous slide.

If we simply change the variance of the inputs, just re-scale them.

Make the first component ten times as big and the second component ten times as

small, we now get a nice circular error surface.

6:49

Shifting and scaling the inputs is a very simple thing to do, but something that's a

bit more complicated. That actually works even better cause it's

guaranteed to give you a circle, a circular error surface.

At least it is for linear neuron. What we do is we try and decorrelate the

components of the input vectors. In other words, if you take two components

and look at how they're correlated with one another over the whole training set.

Like, if you remember the early example how the number of portions of chips.

And the number of portions of ketchup might be highly correlated.

We want to try and get rid of those correlations.

That will make learning much easier. There's actually many ways to de-correlate

things. For those of you who know about principle

components analysis. A very sensible thing to do is apply

principle components analysis. Remove the components that have the

smallest eigenvalues which already achieves some dimensionality reduction.

And then scale the remaining components by dividing them by the square roots of their

eigenvalues. For a linear system, that will give you a

circular error surface. If you don't know about principle

components, we'll cover it later in the course.

8:05

Once you got a circular error surface, the gradient points straight towards the

minimum, so learning is really easy. Now, let's talk about a few of the common

problems that people encounter. One thing that can happen is if you start

with a learning rate that's much too big, you drive the hidden units either to be

firmly on, or firmly off. That is the incoming weights are very big

in positive or very big in negative. And their state no longer depends on the

input and of course that means that error root is coming from output won't affect

them, because they are on the plateaus where the derivative is basically zero.

And so learning will stop. Because people are expecting to see local

minimum, when learning stops they say, oh, I'm at a local minimum and the error's

terrible. So there are these really bad local

minimum, Usually that's not true.

Usually it's because you got stuck out on the end of a plateau.

9:02

A second problem that occurs, is, if you are classifying things and you're using

either a squared error or a cross entropy error.

The best guessing strategy is normally to make the output unit equal to the

proportion of the time that it should be one.

9:20

The network will fairly quickly find that strategy and so the error will fall

quickly, but particularly if the network has many layers it may take a long time

before it improves much on that. Because to improve over the guessing

statedgy it has to get sensible information from the input through all the

hidden layers to the output and that could take a long time to learn if you start

with small weights. So again, you learn quickly and then the

error stops decreasing, and it looks like a local minimum but actually it's another

platter. I mentioned earlier that towards the end

of learning, you should turn down the learning rate.

You should also be careful about turning down the learning rate too soon.

When you turn down the learning rate you reduce the random fluctuations in the area

do to the different gradings on different mini batches.

But of course you also reduce the rate of learning.

So if you look at the red curve you see that when we turn the learning rate down

we got a great win. The error fell but after that we get

slower learning. And if we do that too soon we're gonna

loose relative to the green curve. So don't turn down the learning rate too

soon, not too much. I'm now gonna talk about four main ways to

speed up mini-batch learning a lot. The previous things I talked about were

kind of a bag of tricks for making things work better.

And these are four methods all explicitly designed to make the learning go much

faster. I'm now gonna talk about a mathical

moment. In this method we don't use the gradient

to change the position of the whites. That is, if you think of the whites as a

ball on the error surface, standard gradient descent uses the gradient to

change the position of that ball. You simply multiply the gradient by a

learning rate and change the position of the ball by that vector.

In the momentum method, we use the gradient to accelerate this ball.

That is the gradient changes it's velocity.

And then the velocity is what changes the position of the ball.

The reason that's different is because the bull can have momentum.

That is, it remembers previous gradients in its philosophy.

11:43

A second method for speeding up when you're batch learning is to use a separate

adaptive learning rate for each parameter. And then to slowly adjust that learning

rate based on empirical measurements. And the obvious empirical measurement is

are we keeping making progress by changing the weights in the same direction?

Or does the gradient keep oscillating around so that the sign of the grading

keeps changing. If the sign of the grading keeps changing,

what we're going to do is reduce the learning rate and if it keeps staying the

same, we're going to increase the learning rate.

12:16

A third method is what I now call rms prop and what we do in this method is we divide

by a running average of the magnitudes of the recent gradients flat weight.

So that if the gradients are big you divided by a large number and if the

gradients is small and you divide then divide by small number.

That will deal very nicely with a wide range of different gradients.

It's actually a mini batch version of just using the sign of the gradient which is a

method called R prompt, that was designed for full batch learning.

The final way of speeding up learning, which is what optimization people would

naturally recommend, is to use full batch learning.

And to use a fancy method that takes curvature information into account.

To adapt that method to work for neural nets.

And then maybe to try and adapt it some more, so it works with mini batches.

I am not going to talk about that in this lecture.