My cost function J(theta) is now going to be the following.

Is - 1 over M of a sum of a similar term to what we have for

logistic regression, except that we have the sum from K equals 1 through K.

This summation is basically a sum over my K output.

A unit. So if I have four output units,

that is if the final layer of my neural network has four output units,

then this is a sum from k equals one through four of

basically the logistic regression algorithm's cost function but

summing that cost function over each of my four output units in turn.

And so you notice in particular that this applies to Yk Hk,

because we're basically taking the K upper units, and comparing that to the value

of Yk which is that one of those vectors saying what cost it should be.

And finally, the second term here is the regularization term,

similar to what we had for the logistic regression.

This summation term looks really complicated, but all it's doing is it's

summing over these terms theta j i l for all values of i j and l.

Except that we don't sum over the terms corresponding to these bias values

like we have for logistic progression.

Completely, we don't sum over the terms responding to where i is equal to 0.

So that is because when we're computing the activation of a neuron,

we have terms like these.

Theta i 0.

Plus theta i1, x1 plus and so on.

Where I guess put in a two there, this is the first hit in there.

And so the values with a zero there,

that corresponds to something that multiplies into an x0 or an a0.

And so this is kinda like a bias unit and by analogy to what we were doing for

logistic progression, we won't sum over those terms in our regularization term

because we don't want to regularize them and string their values as zero.

But this is just one possible convention, and even if you were to sum over i

equals 0 up to Sl, it would work about the same and doesn't make a big difference.

But maybe this convention of not regularizing the bias term

is just slightly more common.