0:00

when you boost a neural network one of

the choices you get to make is what

activation functions use independent

layers as well as at the output unit of

your neural network so far we've just

been using the sigmoid activation

function but sometimes other choices can

work much better let's take a look at

some of the options in the forward

propagation steps for the neural network

we have these two steps where we use the

sigmoid function here so that sigmoid is

called an activation function and G is

the familiar sigmoid function I equals 1

over 1 plus nu to negative G so in the

more general case we can have a

different function G of Z visually right

here where G could be a nonlinear

function that may not be the sigmoid

function so for example the sigmoid

function goes between 0 & 1 an

activation function that almost always

works better than the sigmoid function

is the 10h function or the hyperbolic

tangent function so this is Z this is a

this is a equals 10 H of Z and this goes

between plus 1 and minus 1 the formula

for the 10h function is e to the Z minus

e to negative V over there some and it's

actually mathematically a shifted

version of the sigmoid function so as a

you know sigmoid function just like that

but shift it so that it now crosses a

zero zero point and rescale so it goes

to G minus one and plus one and it turns

out that for hidden units if you let the

function G of Z be equal to

ten HSV this almost always works better

than the sigmoid function because with

values between plus one and minus one

the mean of the activations that come

out of your head in there are closer to

having a zero mean and so just as

sometimes when you train a learning

algorithm

you might Center the data and have your

data have zero mean using a 10-8 instead

of a sigmoid function kind of has the

effect of centering your data so that

the mean of the data is close to the

zero rather than maybe a 0.5 and this

actually makes learning for the next

layer a little bit easier we'll say more

about this in the second course when we

talk about optimization algorithms as

well but one takeaway is that I pretty

much never use the sigmoid activation

function anymore

the 10-ish function is almost always

strictly superior the one exception is

for the output layer because if Y is

either 0 or 1 then it makes sense for y

hat to be a number that you want to

output plus between 0 and 1 rather than

between minus 1 and 1 so the one

exception where I would use the sigmoid

activation function is when you're using

binary classification in which case you

might use the sigmoid activation

function for the output layer so G of Z

2 here is equal to Sigma of Z 2 and so

what you see in this example is where

you might have a ten-inch activation

function for the hidden layer and

sigmoid for the output layer so the

activation functions can be different

for different layers and sometimes to

denote that the activation functions are

different for different layers we might

use these square brackets super scripts

as well to indicate that G of square

bracket 1 may be different than G square

bracket 2 red mccain square bracket 1

superscript refers to this layer and

superscript square bracket 2 refers to

the Alpha layer

now one of the downsides of both the

sigmoid function and the 10h function is

that if Z is either very large or very

small then the gradient of the

derivative or the slope of this function

becomes very small so Z is very large or

Z is very small the slope of the

function you know ends up being close to

zero and so this can slow down gradient

descent so one of the toys that is very

popular in machine learning is what's

called the rectified linear unit so the

value function looks like this and the

formula is a equals max of 0 comma Z so

the derivative is 1 so long as Z is

positive and derivative or the slope is

0 when Z is negative if you're

implementing this technically the

derivative when Z is exactly 0 is not

well-defined but when you implement is

in the computer the often you get

exactly the equals 0 0 0 0 0 0 0 0 0 0

it is very small so you don't need to

worry about it in practice you could

pretend a derivative when Z is equal to

0 you can pretend is either 1 or 0 and

you can work just fine so the fact that

is not differentiable the fact that so

here are some rules of thumb for

choosing activation functions if your

output is 0 1 value if you're I'm using

binary classification then the sigmoid

activation function is very natural for

the upper layer and then for all other

units on varalu or the rectified linear

unit is increasingly the default choice

of activation function so if you're not

sure what to use for your head in there

I would just use the relu activation

function that's what you see most people

using these days although sometimes

people also use the tannish activation

function

once this advantage of the value is that

the derivative is equal to zero when V

is negative in practice this works just

fine but there is another version of the

value called the least G value will give

you the formula on the next slide but

instead of it being zero when G is

negative it just takes a slight slope

like so so this is called the Whiskey

value this usually works better than the

value activation function although it's

just not used as much in practice either

one should be fine

although if you had to pick one I

usually just use the revenue and the

advantage of both the value and only key

value is that for a lot of the space of

Z the derivative of the activation

function the slope of the activation

function is very different from zero and

so in practice using the regular

activation function your new network

will often learn much faster than using

the ten age or the sigmoid activation

function and the main reason is that on

this less of this effect of the slope of

the function going to zero which slows

down learning and I know that for half

of the range of Z the slope of relu is

zero but in practice enough of your

hidden units will have Z greater than

zero so learning can still be quite mask

for most training examples so let's just

quickly recap there are pros and cons of

different activation functions here's

the sigmoid activation function I would

say never use this except for the output

layer if you are doing binary

classification or maybe almost never use

this and the reason I almost never use

this is because the 10h is pretty much

strictly superior so the 10-inch

activation function is this and then the

default the most commonly used

activation function is the Grandview

which is this so you're not sure what

else to use use this one and maybe you

know feel free also to try to leek you

really know where um might be

0.01 G Komen Z right so a is the max of

0.01 times Z and Z so that gives you

this some Bend in the function and you

might say you know why is that constant

0.01 well you can also make that another

parameter of the learning algorithm and

some people say that works even better

but I hardly see people do that so but

if you feel like trying it in your

application you know please feel free to

do so and and you can just see how it

works and how long works and stick with

it if it gives you good result so I hope

that gives you a sense of some of the

choices of activation functions you can

use in your network one of the themes

we'll see in deep learning is that you

often have a lot of different choices in

how you code your neural network ranging

from number of credit units to the

chosen activation function to how you

neutralize the waves which we'll see

later a lot of choices like that and it

turns out that is sometimes difficult to

get good guidelines for exactly what

will work best for your problem so so

these three causes I'll keep on giving

you a sense of what I see in the

industry in terms of what's more or less

popular but for your application with

your applications video synthesis it's

actually very difficult to know in

advance exactly what will work best so

concrete values would be if you're not

sure which one of these activation

functions work best you know try them

all and then evaluate on like a holdout

validation set or like a development set

which we'll talk about later and see

which one works better and then go of

that and I think that by testing these

different choices for your application

you'd be better at future proofing your

neural network architecture against the

the distinction sees our problem as well

evolutions of the algorithms rather than

you know if I were to tell you always

use a random activation and don't use

anything else that that just may or may

not apply for whatever problem you end

up working on you know either

either in the near future on the distant

future all right so that was a choice of

activation functions you've seen the

most popular activation functions

there's one other question that

sometimes is ask which is why do you

even need to you

activation function at all why not just

do away with that so let's talk about

that

in the next video and when you see why

new network do means some sort of

nonlinear activation function