0:00

In the rise of deep learning,

Â one of the most important ideas has been an algorithm called batch normalization,

Â created by two researchers, Sergey Ioffe and Christian Szegedy.

Â Batch normalization makes your hyperparameter search problem much easier,

Â makes your neural network much more robust.

Â The choice of hyperparameters is a much bigger range of hyperparameters that work

Â well, and will also enable you to much more easily train even very deep networks.

Â Let's see how batch normalization works.

Â When training a model, such as logistic regression, you might remember that

Â normalizing the input features can speed up learnings in compute the means,

Â subtract off the means from your training sets.

Â Compute the variances.

Â 0:49

And then normalize your data set according to the variances.

Â And we saw in an earlier video how this can turn the contours of your learning

Â problem from something that might be very elongated to something that is more round,

Â and easier for an algorithm like gradient descent to optimize.

Â So this works, in terms of normalizing the input feature values

Â to a neural network, alter the regression.

Â Now, how about a deeper model?

Â You have not just input features x, but in this layer you have activations a1,

Â in this layer, you have activations a2 and so on.

Â So if you want to train the parameters, say w3, b3, then

Â 2:20

So this is what batch norm does, batch normalization, or batch norm for

Â short, does.

Â Although technically, we'll actually normalize the values of not a2 but z2.

Â There are some debates in the deep learning literature about whether you

Â should normalize the value before the activation function, so z2, or whether

Â you should normalize the value after applying the activation function, a2.

Â In practice, normalizing z2 is done much more often.

Â So that's the version I'll present and

Â what I would recommend you use as a default choice.

Â So here is how you will implement batch norm.

Â Given some intermediate values, In your neural net,

Â 3:09

Let's say that you have some hidden unit values z1 up to zm,

Â and this is really from some hidden layer,

Â so it'd be more accurate to write this as z for

Â some hidden layer i for i equals 1 through m.

Â But to reduce writing, I'm going to omit this [l],

Â just to simplify the notation on this line.

Â So given these values, what you do is compute the mean as follows.

Â Okay, and all this is specific to some layer l, but I'm omitting the [l].

Â And then you compute the variance using pretty much the formula you

Â would expect and then you would take each the zis and normalize it.

Â So you get zi normalized by subtracting off the mean and

Â dividing by the standard deviation.

Â For numerical stability, we usually add epsilon to the denominator like

Â that just in case sigma squared turns out to be zero in some estimate.

Â And so now we've taken these values z and

Â normalized them to have mean 0 and standard unit variance.

Â So every component of z has mean 0 and variance 1.

Â But we don't want the hidden units to always have mean 0 and variance 1.

Â Maybe it makes sense for hidden units to have a different distribution,

Â so what we'll do instead is compute,

Â I'm going to call this z tilde = gamma zi norm + beta.

Â And here, gamma and beta are learnable parameters of your model.

Â 5:28

plus epsilon, so if gamma were equal to this denominator term.

Â And if beta were equal to mu, so this value up here,

Â then the effect of gamma z norm plus beta is

Â that it would exactly invert this equation.

Â So if this is true,

Â then actually z tilde i is equal to zi.

Â And so by an appropriate setting of the parameters gamma and beta,

Â this normalization step, that is,

Â these four equations is just computing essentially the identity function.

Â But by choosing other values of gamma and beta, this allows you to make the hidden

Â unit values have other means and variances as well.

Â And so the way you fit this into your neural network is,

Â whereas previously you were using these values z1, z2, and so

Â on, you would now use z tilde i, Instead of zi for

Â the later computations in your neural network.

Â And you want to put back in this [l] to explicitly denote which layer it is in,

Â you can put it back there.

Â So the intuition I hope you'll take away from this is that we saw how

Â normalizing the input features x can help learning in a neural network.

Â And what batch norm does is it applies that normalization process not just

Â to the input layer, but

Â to the values even deep in some hidden layer in the neural network.

Â So it will apply this type of normalization to normalize the mean and

Â variance of some of your hidden units' values, z.

Â But one difference between the training input and these hidden unit values is you

Â might not want your hidden unit values be forced to have mean 0 and variance 1.

Â For example, if you have a sigmoid activation function,

Â you don't want your values to always be clustered here.

Â You might want them to have a larger variance or have a mean that's different

Â than 0, in order to better take advantage of the nonlinearity of

Â the sigmoid function rather than have all your values be in just this linear regime.

Â So that's why with the parameters gamma and beta,

Â you can now make sure that your zi values have the range of values that you want.

Â But what it does really is it then shows that your hidden units have

Â standardized mean and variance, where the mean and

Â variance are controlled by two explicit parameters gamma and

Â beta which the learning algorithm can set to whatever it wants.

Â So what it really does is it normalizes in mean and variance of these hidden

Â unit values, really the zis, to have some fixed mean and variance.

Â And that mean and variance could be 0 and 1, or it could be some other value,

Â and it's controlled by these parameters gamma and beta.

Â So I hope that gives you a sense of the mechanics of how to implement batch norm,

Â at least for a single layer in the neural network.

Â In the next video, I'm going to show you how to fit batch norm into a neural

Â network, even a deep neural network, and how to make it work for

Â the many different layers of a neural network.

Â And after that, we'll get some more intuition about why batch norm could

Â help you train your neural network.

Â So in case why it works still seems a little bit mysterious, stay with me, and

Â I think in two videos from now we'll really make that clearer.

Â