0:00

In the previous video,

Â you saw how you can use vectorization to compute their predictions.

Â The lowercase a's for an entire training set O at the same time.

Â In this video, you see how you can use vectorization to also

Â perform the gradient computations for all M training samples.

Â Again, all sort of at the same time.

Â And then at the end of this video,

Â we'll put it all together and show how you can derive

Â a very efficient implementation of logistic regression.

Â So, you may remember that for the gradient computation,

Â what we did was we computed dz1 for the first example,

Â which could be a1 minus y1 and then dz2 equals

Â a2 minus y2 and so on.

Â And so on for all M training examples.

Â So, what we're going to do is define a new variable,

Â dZ is going to be dz1, dz2, dzm.

Â Again, all the D lowercase z variables stacked horizontally.

Â So, this would be 1 by m matrix or alternatively a m dimensional row vector.

Â Now recall that from the previous slide,

Â we'd already figured out how to compute capital A which was this: a1 through

Â am and we had to find capital Y as y1 through ym.

Â Also you know, stacked horizontally.

Â So, based on these definitions,

Â maybe you can see for yourself that dz can be computed as

Â just A minus Y because it's going to be equal to a1 - y1.

Â So, the first element, a2 - y2,

Â so in the second element and so on.

Â And, so this first element a1 - y1 is exactly the definition of dz1.

Â The second element is exactly the definition of dz2 and so on.

Â So, with just one line of code,

Â you can compute all of this at the same time.

Â Now, in the previous implementation,

Â we've gotten rid of one full loop already but we still had

Â this second full loop over 20 examples.

Â So we initialize dw to zero to a vector of zeroes.

Â But then we still have to loop over 20 examples where we have

Â dw plus equals x1 times dz1,

Â for the first training example dw plus equals x2 dz2 and so on.

Â So we do the M times and then dw divide equals by M and similarly for B, right?

Â db was initialized as 0 and db plus equals dz1.

Â db plus equals dz2 down to you know

Â dz(m) and db divide equals M. So that's what we had in the previous implementation.

Â We'd already got rid of one full loop.

Â So, at least now dw is a vector and we went separately updating dw1,

Â dw2 and so on.

Â So, we got rid of that already but we still

Â had the full loop over the M examples in the training set.

Â So, let's take these operations and vectorize them.

Â Here's what we can do, for

Â the vectorize implementation of db was doing is basically summing up,

Â all of these dzs and then dividing by m. So,

Â db is basically one over a m,

Â sum from I equals once through m of dzi and

Â well all the dzs are in that row vector and so in Python,

Â what you do is implement you know,

Â 1 over a m times np.

Â sum of the z.

Â So, you just take this variable and call the np.

Â sum function on it and that would give you db.

Â How about dw or just write

Â out the correct equations who can verify is the right thing to do.

Â DW turns out to be one over M,

Â times the matrix X times dz transpose.

Â And, so kind of see why that's the case.

Â This equal to one of m then the matrix X's,

Â x1 through xm stacked up in columns like that and dz

Â transpose is going to be dz1 down to dz(m) like so.

Â And so, if you figure out what this matrix times this vector works out to be,

Â it is turns out to be one over m times x1

Â dz1 plus... plus xm dzm.

Â And so, this is a n/1 vector and this is what you actually end up with,

Â with dw because dw was taking these you know,

Â xi dzi and adding them up and so that's what exactly

Â this matrix vector multiplication is doing and so again,

Â with one line of code you can compute dw.

Â So, the vectorized implementation of the derivative calculations is just this,

Â you use this line to implement db and use

Â this line to implement dw and notice that with all the full loop over the training set,

Â you can now compute the updates you want to your parameters.

Â So now, let's put all together into how you would actually implement logistic regression.

Â So, this is our original,

Â highly inefficient non vectorize implementation.

Â So, the first thing we've done in the previous video was get rid of this volume, right?

Â So, instead of looping over dw1,

Â dw2 and so on,

Â we have replaced this with a vector value dw which is dw+= xi,

Â which is now a vector times dz(i).

Â But now, we will see that we can also get rid of not

Â just full loop of row but also get rid of this full loop.

Â So, here is how you do it.

Â So, using what we have from the previous slides,

Â you would say, capitalZ,

Â Z equal to w transpose X + B and the code you is write capital Z equals np.

Â w transpose X + B and then a equals sigmoid of capital Z.

Â So, you have now computed all of this and all of this for all the values of I.

Â Next on the previous slide,

Â we said you would compute the z equals A - Y.

Â So, now you computed all of this for all the values of i.

Â Then, finally dw equals 1/m x

Â dz transpose and db equals 1/m of you know, np.

Â sum dz.

Â So, you've just done front propagation and back propagation,

Â really computing the predictions and computing the derivatives on

Â all M training examples without using a full loop.

Â And so the gradient descent update then would be you know W

Â gets updated as w minus the learning rate times

Â dw which was just computed above and B is update as B minus the learning rate times db.

Â Sometimes it's pretty close to notice that it is an assignment,

Â but I guess I haven't been totally consistent with that notation.

Â But with this, you have just implemented

Â a single elevation of gradient descent for logistic regression.

Â Now, I know I said that we should get rid of

Â explicit full loops whenever you can but if you want to

Â implement multiple adjuration as

Â a gradient descent then you still need a full loop over the number of iterations.

Â So, if you want to have a thousand deliberations of gradient descent,

Â you might still need a full loop over the iteration number.

Â There is an outermost full loop like that then I

Â don't think there is any way to get rid of that full loop.

Â But I do think it's incredibly cool that you can implement

Â at least one iteration of gradient descent without needing to use a full loop.

Â So, that's it you now have a highly vectorize and

Â highly efficient implementation of gradient descent for logistic regression.

Â There is just one more detail that I want to talk about in the next video,

Â which is in our description here I briefly alluded to this technique called broadcasting.

Â Broadcasting turns out to be a technique that Python and

Â numpy allows you to use to make certain parts of your code also much more efficient.

Â So, let's see some more details of broadcasting in the next video.

Â