When you build your neural network, one of the choices you get to make is what activation function to use in the hidden layers as well as at the output units of your neural network. So far, we've just been using the sigmoid activation function, but sometimes other choices can work much better. Let's take a look at some of the options. In the forward propagation steps for the neural network, we had these two steps where we use the sigmoid function here. So that sigmoid is called an activation function. And here's the familiar sigmoid function, a = 1/1 + e to -z. So in the more general case, we can have a different function g(z). Which I'm going to write here where g could be a nonlinear function that may not be the sigmoid function. So for example, the sigmoid function goes between zero and one. An activation function that almost always works better than the sigmoid function is the tangent function or the hyperbolic tangent function. So this is z, this is a, this is a = tan h(z). And this goes between +1 and -1. The formula for the tan h function is e to the z minus e to-z over their sum. And it's actually mathematically a shifted version of the sigmoid function. So as a sigmoid function just like that but shifted so that it now crosses the zero zero point on the scale. So it goes between minus one and plus one. And it turns out that for hidden units, if you let the function g(z) be equal to tan h(z). This almost always works better than the sigmoid function because with values between plus one and minus one, the mean of the activations that come out of your hidden layer are closer to having a zero mean. And so just as sometimes when you train a learning algorithm, you might center the data and have your data have zero mean using a tan h instead of a sigmoid function. Kind of has the effect of centering your data so that the mean of your data is close to zero rather than maybe 0.5. And this actually makes learning for the next layer a little bit easier. We'll say more about this in the second course when we talk about optimization algorithms as well. But one takeaway is that I pretty much never use the sigmoid activation function anymore. The tan h function is almost always strictly superior. The one exception is for the output layer because if y is either zero or one, then it makes sense for y hat to be a number that you want to output that's between zero and one rather than between -1 and 1. So the one exception where I would use the sigmoid activation function is when you're using binary classification. In which case you might use the sigmoid activation function for the upper layer. So g(z2) here is equal to sigmoid of z2. And so what you see in this example is where you might have a tan h activation function for the hidden layer and sigmoid for the output layer. So the activation functions can be different for different layers. And sometimes to denote that the activation functions are different for different layers, we might use these square brackets superscripts as well to indicate that gf square bracket one may be different than gf square bracket two, right. Again, square bracket one superscript refers to this layer and superscript square bracket two refers to the output layer. Now, one of the downsides of both the sigmoid function and the tan h function is that if z is either very large or very small, then the gradient of the derivative of the slope of this function becomes very small. So if z is very large or z is very small, the slope of the function either ends up being close to zero and so this can slow down gradient descent. So one other choice that is very popular in machine learning is what's called the rectified linear unit. So the value function looks like this and the formula is a = max(0,z). So the derivative is one so long as z is positive and derivative or the slope is zero when z is negative. If you're implementing this, technically the derivative when z is exactly zero is not well defined. But when you implement this in the computer, the odds that you get exactly z equals 000000000000 is very small. So you don't need to worry about it. In practice, you could pretend a derivative when z is equal to zero, you can pretend is either one or zero. And you can work just fine. So the fact is not differentiable. The fact that, so here's some rules of thumb for choosing activation functions. If your output is zero one value, if you're using binary classification, then the sigmoid activation function is very natural choice for the output layer. And then for all other units value or the rectified linear unit is increasingly the default choice of activation function. So if you're not sure what to use for your hidden layer, I would just use the value activation function, is what you see most people using these days. Although sometimes people also use the tan h activation function. One disadvantage of the value is that the derivative is equal to zero when z is negative. In practice this works just fine. But there is another version of the value called the Leaky ReLU. We'll give you the formula on the next slide but instead of it being zero when z is negative, it just takes a slight slope like so. So this is called Leaky ReLU. This usually works better than the value activation function. Although, it's just not used as much in practice. Either one should be fine. Although, if you had to pick one, I usually just use the value. And the advantage of both the value and the Leaky ReLU is that for a lot of the space of z, the derivative of the activation function, the slope of the activation function is very different from zero. And so in practice, using the value activation function, your neural network will often learn much faster than when using the tan h or the sigmoid activation function. And the main reason is that there's less of this effect of the slope of the function going to zero, which slows down learning. And I know that for half of the range of z, the slope for value is zero. But in practice, enough of your hidden units will have z greater than zero. So learning can still be quite fast for most training examples. So let's just quickly recap the pros and cons of different activation functions. Here's the sigmoid activation function. I would say never use this except for the output layer if you're doing binomial classification or maybe almost never use this. And the reason I almost never use this is because the tan h is pretty much strictly superior. So the tan h activation function is this. And then the default, the most commonly used activation function is the ReLU, which is this. So if you're not sure what else to use, use this one. And maybe, feel free also to try the Leaky ReLU where might be 0.01(z,z), right? So a is the max of 0.1 times z and z. So that gives you this bend in the function. And you might say, why is that constant 0.01? Well, you can also make that another parameter of the learning algorithm. And some people say that works even better, but how they see people do that. So, but if you feel like trying it in your application, please feel free to do so. And you can just see how it works and how well it works, and stick with it if it gives you a good result. So I hope that gives you a sense of some of the choices of activation functions you can use in your neural network. One of the things we'll see in deep learning is that you often have a lot of different choices in how you build your neural network. Ranging from a number of hidden units to the choices activation function, to how you initialize the ways which we'll see later. A lot of choices like that. And it turns out that it is sometimes difficult to get good guidelines for exactly what will work best for your problem. So throughout these courses, I'll keep on giving you a sense of what I see in the industry in terms of what's more or less popular. But for your application with your applications, idiosyncrasies is actually very difficult to know in advance exactly what will work best. So common piece of advice would be, if you're not sure which one of these activation functions work best, try them all. And evaluate on like a holdout validation set or like a development set, which we'll talk about later. And see which one works better and then go of that. And I think that by testing these different choices for your application, you'd be better at future proofing your neural network architecture against the idiosyncracies problems. As well as evolutions of the algorithms rather than, if I were to tell you always use a value activation and don't use anything else. That just may or may not apply for whatever problem you end up working on. Either in the near future or in the distant future. All right, so, that was choice of activation functions and you see the most popular activation functions. There's one other question that sometimes you can ask which is, why do you even need to use an activation function at all? Why not just do away with that? So, let's talk about that in the next video where you see why neural networks do need some sort of non linear activation function.