0:31

And we have on the second line the cost function, J,

Â which is a function of your parameters w and b.

Â And that's defined as the average.

Â So it's 1 over m times the sum of this loss function.

Â And so the loss function measures how well your algorithms

Â outputs y-hat(i) on each of the training examples stacks up or

Â compares to the ground true label y(i) on each of the training examples.

Â And the full formula is expanded out on the right.

Â So the cost function measures how well your parameters w and

Â b are doing on the training set.

Â So in order to learn the set of parameters w and b it seems natural that we want to

Â find w and b that make the cost function J(w, b) as small as possible.

Â So here's an illustration of gradient descent.

Â In this diagram the horizontal axes represent

Â your spatial parameters, w and b.

Â In practice, w can be much higher dimensional, but for the purposes of

Â plotting, let's illustrate w as a single real number and b as a single real number.

Â The cost function J(w,b,) is,

Â then, some surface above these horizontal axes w and b.

Â So the height of the surface represents the value of J(w,b) at a certain point.

Â And what we want to do is really to find the value of w and

Â b that corresponds to the minimum of the cost function J.

Â 2:00

It turns out that this cost function J is a convex function.

Â So it's just a single big bowl, so this is a convex function and

Â this is opposed to functions that look like this,

Â which are non-convex and has lots of different local.

Â So the fact that our cost function J(w,b) as defined

Â here is convex is one of the huge reasons why we use this particular cost function,

Â J, for logistic regression.

Â So to find a good value for the parameters,

Â what we'll do is initialize w and b to some initial value,

Â maybe denoted by that little red dot.

Â And for logistic regression almost any initialization method works,

Â usually you initialize the value to zero.

Â Random initialization also works, but

Â people don't usually do that for logistic regression.

Â But because this function is convex, no matter where you initialize,

Â you should get to the same point or roughly the same point.

Â And what gradient descent does is it starts at that initial point and

Â then takes a step in the steepest downhill direction.

Â So after one step of gradient descent you might end up there, because

Â it's trying to take a step downhill in the direction of steepest descent or

Â as quickly downhill as possible.

Â So that's one iteration of gradient descent.

Â And after two iterations of gradient descent you might step there,

Â three iterations and so on.

Â I guess this is now hidden by the back of the plot until eventually, hopefully you

Â converge to this global optimum or get to something close to the global optimum.

Â So this picture illustrates the gradient descent algorithm.

Â Let's write a bit more of the details.

Â For the purpose of illustration, let's say that there's some function, J(w),

Â that you want to minimize, and maybe that function looks like this.

Â To make this easier to draw, I'm going to ignore b for

Â now, just to make this a one-dimensional plot instead of a high-dimensional plot.

Â So gradient descent does this,

Â we're going to repeatedly carry out the following update.

Â Were going to take the value of w and update it,

Â going to use colon equals to represent updating w.

Â So set w to w minus alpha, times, and

Â this is a derivative dJ(w)/dw.

Â I will repeatedly do that until the algorithm converges.

Â So couple of points in the notation, alpha here, is the learning rate, and

Â controls how big a step we take on each iteration or gradient descent.

Â We'll talk later about some ways by choosing the learning rate alpha.

Â And second, this quantity here, this is a derivative.

Â This is basically the update or the change you want to make to the parameters w.

Â When we start to write code to implement gradient descent,

Â we're going to use the convention that the variable name in our code

Â 4:58

dw will be used to represent this derivative term.

Â So when you write code you write something like

Â w colon equals w minus alpha times dw.

Â And so we use dw to be the variable name to represent this derivative term.

Â Now let's just make sure that this gradient descent update makes sense.

Â Let's say that w was over here.

Â So you're at this point on the cost function J(w).

Â Remember that the definition of a derivative

Â is the slope of a function at the point.

Â So the slope of the function is really the height divided by the width, right,

Â of a low triangle here at this tangent to J(w) at that point.

Â And so, here the derivative is positive.

Â W gets updated as w minus a learning rate times the derivative.

Â The derivative is positive and so you end up subtracting from w, so

Â you end up taking a step to the left.

Â And so gradient descent will make your algorithm slowly

Â decrease the parameter if you have started off with this large value of w.

Â As another example, if w was over here,

Â then at this point the slope here of dJ/dw will be negative and so

Â the gradient descent update would subtract alpha times a negative number.

Â And so end up slowly increasing w, so you end up making w bigger and

Â bigger with successive iterations and gradient descent.

Â So that hopefully whether you initialize on the left or

Â on the right gradient descent will move you towards this global minimum here.

Â If you're not familiar with derivates or with calculus and

Â what this term dJ(w)/dw means, don't worry too much about it.

Â We'll talk some more about derivatives in the next video.

Â If you have a deep knowledge of calculus,

Â you might be able to have a deeper intuitions about how neural networks work.

Â But even if you're not that familiar with calculus,

Â in the next few videos we'll give you enough intuitions about derivatives and

Â about calculus that you'll be able to effectively use neural networks.

Â But the overall intuition for

Â now is that this term represents the slope of the function, and

Â we want to know the slope of the function at the current setting of the parameters

Â so that we can take these steps of steepest descent, so that we know what

Â direction to step in in order to go downhill on the cost function J.

Â 7:36

So we wrote our gradient descent for J(s) if only w was your parameter.

Â In logistic regression, your cost function is a function of both w and b.

Â So in that case, the inner loop of gradient descent, that is this thing here,

Â this thing you have to repeat becomes as follows.

Â You end up updating w as w minus the learning rate times

Â the derivative of J(w,b) respect to w.

Â And you update b as b minus the learning rate times

Â the derivative of the cost function in respect to b.

Â So these two equations at the bottom are the actual update you implement.

Â As an aside I just want to mention one notational convention in calculus that

Â is a bit confusing to some people.

Â I don't think it's super important that you understand calculus, but

Â in case you see this I want to make sure that you don't think too much of this.

Â Which is that in calculus, this term here,

Â we actually write as fallows, of that funny squiggle symbol.

Â So this symbol, this is actually just a lower case d

Â in a fancy font, in a stylized font for when you see this expression all this

Â means is this isn't [INAUDIBLE] J(w,b) or really the slope of the function

Â J(w,b), how much that function slopes in the w direction.

Â And the rule of the notation in calculus, which I think isn't totally logical,

Â but the rule in the notation for calculus, which I think just makes things much

Â more complicated than you need to be is that if J is a function of two or

Â more variables, then instead of using lowercase d you use this funny symbol.

Â This is called a partial derivative symbol.

Â But don't worry about this,

Â and if J is a function of only one variable, then you use lowercase d.

Â So the only difference between whether you use this funny

Â partial derivative symbol or lowercase d as we did on top,

Â is whether J is a function of two or more variables.

Â In which case, you use this symbol, the partial derivative symbol, or

Â if J is only a function of one variable then you use lower case d.

Â This is one of those funny rules of notation in calculus that

Â I think just make things more complicated than they need to be.

Â But if you see this partial derivative symbol all it means is you're measure

Â the slope of the function, with respect to one of the variables.

Â And similarly to adhere to the formerly correct mathematical

Â notation in calculus, because here J has two inputs not just one.

Â This thing at the bottom should be written with this partial derivative simple.

Â But it really means the same thing as, almost the same thing as lower case d.

Â Finally, when you implement this in code,

Â we're going to use the convention that this quantity, really the amount by which

Â you update w, will denote as the variable dw in your code.

Â And this quantity, right?

Â The amount by which you want to update b

Â will denote by the variable db in your code.

Â All right, so, that's how you can implement gradient descent.

Â Now if you haven't seen calculus for a few years, I know that that might seem like

Â a lot more derivatives in calculus than you might be comfortable with so far.

Â But if you're feeling that way, don't worry about it.

Â In the next video, we'll give you better intuition about derivatives.

Â And even without the deep mathematical understanding of calculus,

Â with just an intuitive understanding of calculus

Â you will be able to make neural networks work effectively.

Â So that, let's go onto the next video where we'll talk a little bit more about

Â derivatives.

Â