0:00

In the previous video, we've put together almost all the pieces you need in order to

implement and train in your network.

There's just one last idea I need to share with you,

which is the idea of random initialization.

0:13

When you're running an algorithm of gradient descent, or

also the advanced optimization algorithms, we need to pick some initial value for

the parameters theta.

So for the advanced optimization algorithm,

it assumes you will pass it some initial value for the parameters theta.

0:29

Now let's consider a gradient descent.

For that, we'll also need to initialize theta to something, and

then we can slowly take steps to go downhill using gradient descent.

To go downhill, to minimize the function j of theta.

So what can we set the initial value of theta to?

Is it possible to set the initial value of theta to the vector of all zeros?

Whereas this worked okay when we were using logistic regression,

initializing all of your parameters to zero actually does not work

when you are trading on your own network.

Consider trading the follow Neural network, and

let's say we initialize all the parameters of the network to 0.

And if you do that, then what you, what that means is that at the initialization,

this blue weight, colored in blue is gonna equal to that weight, so they're both 0.

And this weight that I'm coloring in in red,

is equal to that weight, colored in red, and also this weight,

which I'm coloring in green is going to equal to the value of that weight.

And what that means is that both of your hidden units, A1 and A2,

are going to be computing the same function of your inputs.

And thus you end up with for every one of your training examples,

you end up with A 2 1 equals A 2 2.

1:46

And moreover because I'm not going to show this in too much detail, but

because these outgoing weights are the same you can also show

that the delta values are also gonna be the same.

So concretely you end up with delta 1 1, delta 2 1 equals delta 2 2,

and if you work through the map further, what you can show is that the partial

derivatives with respect to your parameters will satisfy the following,

that the partial derivative of the cost function with respected to

breaking out the derivatives respect to these two blue waves in your network.

You find that these two partial derivatives are going to be equal to

each other.

2:31

And so what this means is that even after say one greater descent update,

you're going to update, say, this first blue rate was learning rate times this,

and you're gonna update the second blue rate with some learning rate times this.

And what this means is that even after one created the descent update, those two blue

rates, those two blue color parameters will end up the same as each other.

So there'll be some nonzero value, but this value would equal to that value.

And similarly,

even after one gradient descent update, this value would equal to that value.

There'll still be some non-zero values,

just that the two red values are equal to each other.

And similarly, the two green ways.

Well, they'll both change values, but

they'll both end up with the same value as each other.

So after each update, the parameters corresponding to the inputs going into

each of the two hidden units are identical.

That's just saying that the two green weights are still the same, the two red

weights are still the same, the two blue weights are still the same, and what that

means is that even after one iteration of say, gradient descent and descent.

You find that your two headed units are still computing exactly the same

functions of the inputs.

You still have the a1(2) = a2(2).

And so you're back to this case.

And as you keep running greater descent, the blue waves,, the two blue waves,

will stay the same as each other.

The two red waves will stay the same as each other and

the two green waves will stay the same as each other.

3:56

And what this means is that your neural network really can compute

very interesting functions, right?

Imagine that you had not only two hidden units, but

imagine that you had many, many hidden units.

Then what this is saying is that all of your headed units are computing

the exact same feature.

All of your hidden units are computing the exact same function of the input.

And this is a highly redundant representation

because you find the logistic progression unit.

It really has to see only one feature because all of these are the same.

And this prevents you and your network from doing something interesting.

4:41

Concretely, the problem was saw on the previous slide is something

called the problem of symmetric ways, that's the ways are being the same.

So this random initialization is how we perform symmetry breaking.

So what we do is we initialize each value of theta to a random number between

minus epsilon and epsilon.

So this is a notation to b numbers between minus epsilon and plus epsilon.

So my weight for my parameters are all going to be

randomly initialized between minus epsilon and plus epsilon.

The way I write code to do this in octave is I've said Theta1 should be equal

to this.

So this rand 10 by 11,

that's how you compute a random 10 by 11 dimensional matrix.

All the values are between 0 and 1, so these are going to be

raw numbers that take on any continuous values between 0 and 1.

And so if you take a number between zero and

one, multiply it by two times INIT_EPSILON then minus INIT_EPSILON,

then you end up with a number that's between minus epsilon and plus epsilon.

5:45

And the so that leads us, this epsilon here has nothing to

do with the epsilon that we were using when we were doing gradient checking.

So when numerical gradient checking,

there we were adding some values of epsilon and theta.

This is your unrelated value of epsilon.

We just wanted to notate init epsilon just to distinguish it

from the value of epsilon we were using in gradient checking.

And similarly if you want to initialize theta2 to a random 1 by 11

matrix you can do so using this piece of code here.

6:16

So to summarize, to create a neural network what you should

do is randomly initialize the waves to small values close to zero,

between -epsilon and +epsilon say.

And then implement back propagation, do great in checking,

and use either great in descent or 1b advanced optimization algorithms to try to

minimize j(theta) as a function of the parameters theta starting from just

randomly chosen initial value for the parameters.

And by doing symmetry breaking,

which is this process, hopefully great gradient descent or

the advanced optimization algorithms will be able to find a good value of theta.