0:00

Hello, and welcome back.

Â In this week, you learn about optimization algorithms

Â that will enable you to train your neural network much faster.

Â You've heard me say before that applying machine learning is a highly empirical process,

Â is highly iterative process.

Â In which you just had to train a lot of models to find one that works really well.

Â So, it really helps to really train models quickly.

Â One thing that makes it more difficult is that

Â Deep Learning does not work best in a regime of big data.

Â We are able to train neural networks on a huge data

Â set and training on a large data set is just slow.

Â So, what you find is that having fast optimization algorithms,

Â having good optimization algorithms can really

Â speed up the efficiency of you and your team.

Â So, let's get started by talking about mini-batch gradient descent.

Â You've learned previously that vectorization allows

Â you to efficiently compute on all m examples,

Â that allows you to process your whole training set without an explicit formula.

Â That's why we would take our training examples and stack them

Â into these huge matrix capsule Xs.

Â X1, X2, X3, and then eventually it goes up to X, M training samples.

Â And similarly for Y this is Y1 and Y2,

Â Y3 and so on up to YM.

Â So, the dimension of X was an X by M and this was 1 by M. Vectorization allows

Â you to process all M examples relatively

Â quickly if M is very large then it can still be slow.

Â For example what if M was 5 million or 50 million or even bigger.

Â With the implementation of gradient descent on your whole training set,

Â what you have to do is,

Â you have to process your entire training set

Â before you take one little step of gradient descent.

Â And then you have to process your entire training sets of

Â five million training samples again before

Â you take another little step of gradient descent.

Â So, it turns out that you can get a faster algorithm if you let gradient descent

Â start to make some progress even before you finish processing your entire,

Â your giant training sets of 5 million examples.

Â In particular, here's what you can do.

Â Let's say that you split up your training set into smaller,

Â little baby training sets and these baby training sets are called mini-batches.

Â And let's say each of your baby training sets have just 1,000 examples each.

Â So, you take X1 through X1,000 and you call that your first little baby training set,

Â also call the mini-batch.

Â And then you take home the next 1,000 examples.

Â X1,001 through X2,000 and then X1,000 examples and come next one and so on.

Â I'm going to introduce a new notation I'm going to call

Â this X superscript with curly braces,

Â 1 and I am going to call this,

Â X superscript with curly braces, 2.

Â Now, if you have 5 million training samples total

Â and each of these little mini batches has a thousand examples,

Â that means you have 5,000 of these because you know 5,000 times 1,000 equals 5 million.

Â Altogether you would have 5,000 of these mini batches.

Â So it ends with X superscript curly braces

Â 5,000 and then similarly you do the same thing for Y.

Â You would also split up your training data for Y accordingly.

Â So, call that Y1 then this is Y1,001 through Y2,000.

Â This is called, Y2 and so on until you have Y5,000.

Â Now, mini batch number T is going to be comprised of X,

Â T and Y, T. And

Â that is a thousand training samples with the corresponding input output pairs.

Â Before moving on, just to make sure my notation is clear,

Â we have previously used superscript round brackets I to index in the training set so X I,

Â is the I training sample.

Â We use superscript, square brackets

Â L to index into the different layers of the neural network.

Â So, ZL comes from the Z value,

Â the L layer of the neural network and here we are introducing

Â the curly brackets T to index into different mini batches.

Â So, you have XT, YT and to check your understanding of these,

Â what is the dimension of XT and YT?

Â Well, X is an X by M. So,

Â if X1 is a thousand training examples or the X values for a thousand examples,

Â then this dimension should be MX by 1,000 and X2 should also be an X by 1,000 and so on.

Â So, all of these should have dimension MX by 1,000 and

Â these should have dimension 1 by 1,000.

Â 5:29

To explain the name of this algorithm,

Â batch gradient descent, refers to

Â the gradient descent algorithm we have been talking about previously.

Â Where you process your entire training set all at the same time.

Â And the name comes from viewing that as

Â processing your entire batch of training samples all at the same time.

Â I know it's not a great name but that's just what it's called.

Â Mini-batch gradient descent in contrast,

Â refers to algorithm which we'll talk about on the next slide

Â and which you process is single mini batch XT,

Â YT at the same time rather than processing your entire training set XY the same time.

Â So, let's see how mini-batch gradient descent works.

Â To run mini-batch gradient descent on your training sets you run for T equals

Â 1 to 5,000 because we had 5,000 mini batches as high as 1,000 each.

Â What are you going to do inside the for loop is basically implement one step of

Â gradient descent using XT comma YT.

Â It is as if you had a training set of size 1,000 examples and it

Â was as if you were to implement the overall you are

Â already familiar with but just on this little training set

Â size of M equals 1,000 rather than having an explicit for loop over all 1,000 examples,

Â you would use vectorization to process all 1,000 examples sort of all at the same time.

Â Let us write this out first,

Â you implemented for a prop on the inputs.

Â So just on XT and you do that by implementing Z1 equals W1.

Â Previously, we would just have X there, right?

Â But now you are processing the entire training set,

Â you are just processing the first mini-batch so that it

Â becomes XT when you're processing mini-batch

Â T. Then you will have A1 equals G1 of Z1,

Â a capital Z since this is actually

Â a vectorizing connotation and so on until you end up with AL,

Â answer is GL of ZL and then this is your prediction.

Â And you notice that here you should use a vectorized implementation.

Â It's just that this vectorized implementation processes

Â 1,000 examples at a time rather than 5 million examples.

Â Next you compute the cost function J which I'm going to write as

Â one over 1,000 since here 1,000 is the size of your little training set.

Â Sum from I equals one through L of really the loss of

Â YI and this notation for clarity,

Â refers to examples from the mini batch XT YT.

Â And if you're using regularization,

Â you can also have this regularization term.

Â Move it to the denominator times sum of L,

Â Frobenius on the way makes it a square.

Â Because this is really the cost on just one mini-batch,

Â I'm going to index as cost J with a superscript T in curly braces.

Â You notice that everything we are doing is exactly the same as when

Â we were previously implementing gradient descent except that instead of doing it on XY,

Â you're not doing it on XT YT.

Â Next, you implement that prop to

Â compute gradients with respect to JT,

Â you are still using only XT YT and then you update the weights W,

Â read WL gets updated as WL

Â minus alpha D WL and similarly for B.

Â This is one pass through your training set using mini-batch gradient descent.

Â The code I have written down here is also called doing one epoch of training and

Â epoch is a word that means a single pass through the training set.

Â Whereas with batch gradient descent,

Â a single pass through the training allows you to take only one gradient descent step.

Â With mini-batch gradient descent, a single pass through the training set,

Â that is one epoch, allows you to take 5,000 gradient descent steps.

Â Now of course you want to take

Â multiple passes through the training set which you usually want to,

Â you might want another for loop for another while loop out there.

Â So you keep taking passes through the training set

Â until hopefully you converge with approximately converge.

Â When you have a lost training set,

Â mini-batch gradient descent runs much faster than batch gradient descent and

Â that's pretty much what everyone in Deep Learning

Â will use when you're training on a large data set.

Â In the next video, let's delve deeper into mini-batch gradient descent so

Â you can get a better understanding of what it is doing and why it works so-

Â