0:00

So, why does batch norm work?

Â Here's one reason, you've seen how normalizing the input features,

Â the X's, to mean zero and variance one,

Â how that can speed up learning.

Â So rather than having some features that range from zero to one,

Â and some from one to a 1,000,

Â by normalizing all the features, input features X,

Â to take on a similar range of values that can speed up learning.

Â So, one intuition behind why batch norm works is,

Â this is doing a similar thing,

Â but further values in your hidden units and not just for your input there.

Â Now, this is just a partial picture for what batch norm is doing.

Â There are a couple of further intuitions,

Â that will help you gain a deeper understanding of what batch norm is doing.

Â Let's take a look at those in this video.

Â A second reason why batch norm works,

Â is it makes weights,

Â later or deeper than your network,

Â say the weight on layer 10, more robust to changes to

Â weights in earlier layers of the neural network, say, in layer one.

Â To explain what I mean,

Â let's look at this most vivid example.

Â Let's see a training on network,

Â maybe a shallow network,

Â like logistic regression or maybe a neural network,

Â maybe a shallow network like this regression or maybe a deep network,

Â on our famous cat detection toss.

Â But let's say that you've trained your data sets on all images of black cats.

Â If you now try to apply this network to

Â data with colored cats where

Â the positive examples are not just black cats like on the left,

Â but to color cats like on the right,

Â then your cosfa might not do very well.

Â So in pictures, if your training set looks like this,

Â where you have positive examples here and negative examples here,

Â but you were to try to generalize it,

Â to a data set where maybe positive examples are here and the negative examples are here,

Â then you might not expect a module trained on the data

Â on the left to do very well on the data on the right.

Â Even though there might be the same function that actually works well,

Â but you wouldn't expect your learning algorithm to discover that green decision boundary,

Â just looking at the data on the left.

Â So, this idea of your data distribution changing goes

Â by the somewhat fancy name, covariate shift.

Â And the idea is that,

Â if you've learned some X to Y mapping,

Â if the distribution of X changes,

Â then you might need to retrain your learning algorithm.

Â And this is true even if the function,

Â the ground true function,

Â mapping from X to Y,

Â remains unchanged, which it is in this example,

Â because the ground true function is,

Â is this picture a cat or not.

Â And the need to retain your function becomes even more

Â acute or it becomes even worse if the ground true function shifts as well.

Â So, how does this problem of covariate shift apply to a neural network?

Â Consider a deep network like this,

Â and let's look at the learning process from the perspective of this certain layer,

Â the third hidden layer.

Â So this network has learned the parameters W3 and B3.

Â And from the perspective of the third hidden layer,

Â it gets some set of values from the earlier layers,

Â and then it has to do some stuff to hopefully make

Â the output Y-hat close to the ground true value Y.

Â So let me cover up the nose on the left for a second.

Â So from the perspective of this third hidden layer,

Â it gets some values,

Â let's call them A_2_1, A_2_2, A_2_3, and A_2_4.

Â But these values might as well be features X1, X2, X3,

Â X4, and the job of the third hidden layer is to

Â take these values and find a way to map them to Y-hat.

Â So you can imagine doing great intercepts,

Â so that these parameters W_3_B_3 as well as maybe W_4_B_4,

Â and even W_5_B_5, maybe try and learn those parameters,

Â so the network does a good job,

Â mapping from the values I drew in black on the left to the output values Y-hat.

Â But now let's uncover the left of the network again.

Â The network is also adapting parameters W_2_B_2 and W_1B_1,

Â and so as these parameters change,

Â these values, A_2, will also change.

Â So from the perspective of the third hidden layer,

Â these hidden unit values are changing all the time,

Â and so it's suffering from the problem of

Â covariate shift that we talked about on the previous slide.

Â So what batch norm does,

Â is it reduces the amount that the distribution of these hidden unit values shifts around.

Â And if it were to plot the distribution of these hidden unit values,

Â maybe this is technically renormalizer Z,

Â so this is actually Z_2_1 and Z_2_2,

Â and I also plot two values instead of four values,

Â so we can visualize in 2D.

Â What batch norm is saying is that,

Â the values for Z_2_1 Z and Z_2_2 can change,

Â and indeed they will change when the neural network updates

Â the parameters in the earlier layers.

Â But what batch norm ensures is that no matter how it changes,

Â the mean and variance of Z_2_1 and Z_2_2 will remain the same.

Â So even if the exact values of Z_2_1 and Z_2_2 change,

Â their mean and variance will at least stay same mean zero and variance one.

Â Or, not necessarily mean zero and variance one,

Â but whatever value is governed by beta two and gamma two.

Â Which, if the neural networks chooses,

Â can force it to be mean zero and variance one.

Â Or, really, any other mean and variance.

Â But what this does is,

Â it limits the amount to which updating the parameters in the earlier layers can

Â affect the distribution of values that

Â the third layer now sees and therefore has to learn on.

Â And so, batch norm reduces the problem of the input values changing,

Â it really causes these values to become more stable,

Â so that the later layers of the neural network has more firm ground to stand on.

Â And even though the input distribution changes a bit,

Â it changes less, and what this does is,

Â even as the earlier layers keep learning,

Â the amounts that this forces the later layers to

Â adapt to as early as layer changes is reduced or,

Â if you will, it weakens the coupling between

Â what the early layers parameters has to do

Â and what the later layers parameters have to do.

Â And so it allows each layer of the network to learn by itself,

Â a little bit more independently of other layers,

Â and this has the effect of speeding up of learning in the whole network.

Â So I hope this gives some better intuition,

Â but the takeaway is that batch norm means that,

Â especially from the perspective of one of the later layers of the neural network,

Â the earlier layers don't get to shift around as much,

Â because they're constrained to have the same mean and variance.

Â And so this makes the job of learning on the later layers easier.

Â It turns out batch norm has a second effect,

Â it has a slight regularization effect.

Â So one non-intuitive thing of a batch norm is that each mini-batch,

Â I will say mini-batch X_t,

Â has the values Z_t,

Â has the values Z_l,

Â scaled by the mean and variance computed on just that one mini-batch.

Â Now, because the mean and variance computed on

Â just that mini-batch as opposed to computed on the entire data set,

Â that mean and variance has a little bit of noise in it,

Â because it's computed just on your mini-batch of,

Â say, 64, or 128,

Â or maybe 256 or larger training examples.

Â So because the mean and variance is a little bit noisy because it's estimated with

Â just a relatively small sample of data, the scaling process,

Â going from Z_l to Z_2_l,

Â that process is a little bit noisy as well,

Â because it's computed, using a slightly noisy mean and variance.

Â So similar to dropout,

Â it adds some noise to each hidden layer's activations.

Â The way dropout has noises,

Â it takes a hidden unit and it multiplies it by zero with some probability.

Â And multiplies it by one with some probability.

Â And so your dropout has multiple of noise because it's multiplied by zero or one,

Â whereas batch norm has multiples of noise because of scaling by the standard deviation,

Â as well as additive noise because it's subtracting the mean.

Â Well, here the estimates of the mean and the standard deviation are noisy.

Â And so, similar to dropout,

Â batch norm therefore has a slight regularization effect.

Â Because by adding noise to the hidden units,

Â it's forcing the downstream hidden units not to rely too much on any one hidden unit.

Â And so similar to dropout,

Â it adds noise to the hidden layers and therefore has a very slight regularization effect.

Â Because the noise added is quite small,

Â this is not a huge regularization effect,

Â and you might choose to use batch norm together with dropout,

Â and you might use batch norm together with dropouts if

Â you want the more powerful regularization effect of dropout.

Â And maybe one other slightly non-intuitive effect is that,

Â if you use a bigger mini-batch size,

Â right, so if you use use a mini-batch size of, say,

Â 512 instead of 64,

Â by using a larger mini-batch size,

Â you're reducing this noise and therefore also reducing this regularization effect.

Â So that's one strange property of dropout

Â which is that by using a bigger mini-batch size,

Â you reduce the regularization effect.

Â Having said this, I wouldn't really use batch norm as a regularizer,

Â that's really not the intent of batch norm,

Â but sometimes it has this extra intended or unintended effect on your learning algorithm.

Â But, really, don't turn to batch norm as a regularization.

Â Use it as a way to normalize

Â your hidden units activations and therefore speed up learning.

Â And I think the regularization is an almost unintended side effect.

Â So I hope that gives you better intuition about what batch norm is doing.

Â Before we wrap up the discussion on batch norm,

Â there's one more detail I want to make sure you know,

Â which is that batch norm handles data one mini-batch at a time.

Â It computes mean and variances on mini-batches.

Â So at test time,

Â you try and make predictors, try and evaluate the neural network,

Â you might not have a mini-batch of examples,

Â you might be processing one single example at the time.

Â So, at test time you need to do something

Â slightly differently to make sure your predictions make sense.

Â Like in the next and final video on batch norm,

Â let's talk over the details of what you need to do in order to take

Â your neural network trained using batch norm to make predictions.

Â