where g could be a nonlinear function that may not be the sigmoid function.

So for example, the sigmoid function goes within zero and one, and

activation function that almost always works better than the sigmoid function

is the tangent function or the hyperbolic tangent function.

So this is z, this is a, this is a equals tanh(z),

and this goes between plus 1 and minus 1.

The formula for the tanh function is e to

the z minus e to the negative z over their sum.

And is actually mathematically, a shifted version of the sigmoid function.

So, as a sigmoid function just like that, but shifted so

that it now crosses a zero zero point and v scale, so it goes 15 minus 1 and plus 1.

And it turns out for hidden units, if you let the function g

of z be equal to tanh(z),

this almost always works better than the sigmoid

function because the values between plus 1 and minus 1, the mean of the activations

that come out of your head, and they are closer to having a 0 mean.

And so just as sometimes when you train a learning algorithm,

you might center the data and

have your data have 0 mean using a tanh instead of a sigmoid function.

It kind of has the effect of centering your data so

that the mean of your data is closer to 0 rather than, maybe 0.5.

And this actually makes learning for the next layer a little bit easier.

We'll say more about this in the second course when we talk about optimization

algorithms as well.

But one takeaway is that I

pretty much never use the sigmoid activation function anymore.

The tanh function is almost always strictly superior.

The one exception is for the output layer because if y is either 0 or

1, then it makes sense for y hat to be a number,

the one to output that's between 0 and 1 rather than between minus 1 and 1.

So the one exception where I would use the sigmoid

activation function is when you are using binary classification,

in which case you might use the sigmoid activation function for the output layer.

So g of z 2 here is equal to sigma of z 2.

And so what you see in this example is where you might have a tanh

activation function for the hidden layer, and sigmoid for the output layer.

So deactivation functions can be different for different layers.

And sometimes to note that activation functions are different for

different layers, we might use these square bracket superscripts as well to

indicate that g of square bracket one may be different than g of square bracket two.

And again, square bracket one superscript refers to this layer, and

superscript square bracket two refers to the output layer.