0:00

In the previous video, we talked about

a cost function for the neural network.

In this video, let's start to talk about an algorithm,

for trying to minimize the cost function.

In particular, we'll talk about the back propagation algorithm.

0:13

Here's the cost function that

we wrote down in the previous video.

What we'd like to do is

try to find parameters theta

to try to minimize j of theta.

In order to use either gradient descent

or one of the advance optimization algorithms.

What we need to do therefore is

to write code that takes

this input the parameters theta

and computes j of theta

and these partial derivative terms.

Remember, that the parameters

in the the neural network of these things,

theta superscript l subscript ij,

that's the real number

and so, these are the partial derivative terms

we need to compute.

In order to compute the

cost function j of theta,

we just use this formula up here

and so, what I want to do

for the most of this video is

focus on talking about

how we can compute these

partial derivative terms.

Let's start by talking about

the case of when we have only

one training example,

so imagine, if you will that our entire

training set comprises only one

training example which is a pair xy.

I'm not going to write x1y1

just write this.

Write a one training example

as xy and let's tap through

the sequence of calculations

we would do with this one training example.

1:25

The first thing we do is

we apply forward propagation in

order to compute whether a hypotheses

actually outputs given the input.

Concretely, the called the

a(1) is the activation values

of this first layer that was the input there.

So, I'm going to set that to x

and then we're going to compute

z(2) equals theta(1) a(1)

and a(2) equals g, the sigmoid

activation function applied to z(2)

and this would give us our

activations for the first middle layer.

That is for layer two of the network

and we also add those bias terms.

2:34

The intuition of the back propagation algorithm

is that for each note

we're going to compute the term

delta superscript l subscript j

that's going to somehow

represent the error

of note j in the layer l.

So, recall that

a superscript l subscript j

that does the activation of

the j of unit in layer l

and so, this delta term

is in some sense

going to capture our error

in the activation of that neural duo.

So, how we might wish the activation

of that note is slightly different.

Concretely, taking the example

5:01

This term g prime of

z3, that formally is actually

the derivative of the activation

function g evaluated at

the input values given by z3.

If you know calculus, you

can try to work it out yourself

and see that you can simplify it to the same answer that I get.

5:16

But I'll just tell you pragmatically what that means.

What you do to compute this g

prime, these derivative terms is

just a3 dot times1

minus A3 where A3

is the vector of activations.

5:58

that this expression is equal

to mathematically, the derivative of

the g function of the activation

function, which I'm denoting

by g prime. And finally,

6:09

that's it and there is

no delta1 term, because the

first layer corresponds to the

input layer and that's just the

feature we observed in our

training sets, so that doesn't have any error associated with that.

It's not like, you know,

we don't really want to try to change those values.

6:30

The name back propagation comes from

the fact that we start by

computing the delta term for

the output layer and then

we go back a layer and

compute the delta terms for the

third hidden layer and then we

go back another step to compute

delta 2 and so, we're sort of

6:51

Finally, the derivation is

surprisingly complicated, surprisingly involved but

if you just do this few steps

steps of computation it is possible

7:31

and computing these delta terms,

you can, you know, pretty

quickly compute these partial

derivative terms for all of your parameters.

8:39

So as we'll see in

a second, these deltas are going

to be used as accumulators that

will slowly add things in

order to compute these partial derivatives.

8:49

Next, we're going to loop through our training set.

So, we'll say for i equals

1 through m and so

for the i iteration, we're

going to working with the training example xi, yi.

9:12

i training example, and then

we're going to perform forward propagation to

compute the activations for

layer two, layer three and so

on up to the final

layer, layer capital L. Next,

we're going to use the output

label yi from this

specific example we're looking

at to compute the error

term for delta L for the output there.

So delta L is what

a hypotheses output minus what

the target label was?

9:41

And then we're going to use

the back propagation algorithm to

compute delta L minus 1,

delta L minus 2, and

so on down to delta 2 and once again

there is now delta 1 because

we don't associate an error term with the input layer.

11:13

that is exactly the partial

derivative of the cost

function with respect to each

of your perimeters and so you

can use those in either gradient

descent or in one of the advanced authorization

11:39

But both in the programming

assignments write out and later

in this video, we'll give

you a summary of this so

we can have all the pieces

of the algorithm together so that

you know exactly what you need

to implement if you want

to implement back propagation to compute

the derivatives of your neural network's

cost function with respect to those parameters.