0:00

In a previous video,

Â you saw the logistic regression model.

Â To train the parameters W and B of the logistic regression model,

Â you need to define a cost function.

Â Let's take a look at the cost function you can use to train logistic regression.

Â To recap, this is what we had to find from the previous slide.

Â So your output y-hat is sigmoid of w

Â transpose x plus b where a sigmoid of Z is as defined here.

Â So to learn parameters for your model you're given

Â a training set of m training examples and it

Â seems natural that you want to find parameters W

Â and B so that at least on the training set, the outputs you have.

Â The predictions you have on the training set,

Â which we only write as y-hat (i) that that will be close to

Â the ground truth labels y_i that you got in the training set.

Â So to throw in a little bit more detail for the equation on top,

Â we had said that y-hat is as defined at the top for

Â a training example x and of course for each training example,

Â we're using these superscripts with

Â round brackets with parentheses to index and to differentiate examples.

Â Your prediction on training sample (i) which is y-hat (i) is going to

Â be obtained by taking the sigmoid function and applying it to W transpose X,

Â (i) the input that the training example plus V and you can also define Z (i) as follows.

Â Z (i) is equal to the W transpose x (i) plus b.

Â So throughout this course,

Â we're going to use this notational convention,

Â that the superscript parentheses i refers to data.

Â X or Y or Z or something else associated with the i-th training example,

Â associated with the i-th example.

Â That's what the superscript i in parentheses means.

Â Now, let's see what loss function or error

Â function we can use to measure how well our algorithm is doing.

Â One thing you could do is define the loss when your algorithm outputs

Â y-hat and the true label as Y to be maybe the square error or one half a square error.

Â It turns out that you could do this,

Â but in logistic regression people don't usually do

Â this because when you come to learn the parameters,

Â you find that the optimization problem which we talk about later becomes non-convex.

Â So you end up with optimization problem with multiple local optima.

Â So gradient descent may not find the global optimum.

Â If you didn't understand the last couple of comments.

Â Don't worry about it, we'll get to it in later video.

Â But the intuition to take away is that

Â this function L called the loss function is a function you'll

Â need to define to measure how good our output y-hat is when the true label is y.

Â As square error seems like it might be a reasonable choice

Â except that it makes gradient descent not work well.

Â So in logistic regression, we will actually define

Â a different loss function that plays a similar role as squared error,

Â that will give us an optimization problem that is

Â convex and so we'll see in that later video becomes much easier to optimize.

Â So, what we use in logistic regression is actually

Â the following loss function which I'm just like right up here,

Â is negative y log y-hat plus one line has y log,

Â one line is y-hat.

Â Here's some intuition for why this loss function makes sense.

Â Keep in mind that if we're using

Â squared error then you want the squared error to be as small as possible.

Â And with this logistic regression loss function,

Â we'll also want this to be as small as possible.

Â To understand why this makes sense,

Â let's look at the two cases.

Â In the first case,

Â let's say Y is equal to one then the loss

Â function y-hat comma y is justice for us to write this negative sign.

Â So this negative log y-hat.

Â If y is equal to one. Because if y equals

Â one then the second term one minus Y is equal to zero.

Â So this says if y equals one you want negative log y-hat to be as big as possible.

Â So that means you want log y-hat to be large,

Â to be as big as possible and that means you want y-hat to be large.

Â But because y-hat is you know,

Â the sigmoid function, it can never be bigger than one.

Â So this is saying that if y is equal to one you,

Â want y-hat to be as big as possible.

Â But it can't ever be bigger than one so saying you want y-hat to be close to one as well.

Â The other case is if y equals zero.

Â If y equals zero then this first term in the loss function is equal to zero because

Â y zero and then the second term defines the loss function.

Â So the loss becomes negative log one minus y-hat.

Â And so if in your learning procedure you try to make the loss function small,

Â what this means is that you want log one minus y-hat to be large.

Â And because it's a negative sign there and

Â then through a similar piece of reason you can conclude

Â that this loss function is trying to make y-hat as small as possible.

Â And again because y-hat has to be between zero and one.

Â This is saying that if y is equal to zero then

Â your loss function will push the parameters to make y-hat as close to zero as possible.

Â Now, there are a lot of functions with Rafidah's effect that if y is equal

Â to one we try to make y-hat large and if Y is equal to zero we try to make y-hat small.

Â We just gave here in green

Â a somewhat informal justification for this loss function will provide

Â an optional video later to give a more formal justification for

Â why in logistic regression we like to use the loss function with this particular form.

Â Finally, the loss function was defined with respect to a single training example.

Â It measures how well you're doing on a single training example.

Â I'm now going to define something called the cost function,

Â which measures how well you're doing an entire training set.

Â So the cost function J which is applied to

Â your parameters W and B is going to be the average with one of

Â the m of the sum of the loss function applied to each of the training examples and turn.

Â While here y-hat is of course

Â the prediction output by your logistic regression algorithm using you know,

Â a particular set of parameters W and B.

Â And so just to expand this out,

Â this is equal to negative one over

Â m sum from i equals one through m of the definition of the loss function.

Â So this is y (i) Log y-hat

Â (i) plus one line is y (i) log one line is y-hat (i).

Â I guess I could put square brackets here.

Â So the minus sign is outside everything else.

Â So the terminology I'm going to use is that

Â the loss function is applied to just a single training example like so.

Â And the cost function is the cost of your parameters.

Â So in training your logistic regression model,

Â we're going to try to find parameters W and B that

Â minimize the overall costs of machine J written at the bottom.

Â So, you've just seen the set up for the logistic regression algorithm,

Â the loss function for training example and the

Â overall cost function for the parameters of your algorithm.

Â It turns out that logistic regression can be viewed as a very very small neural network.

Â In the next video we'll go over that so you can start

Â gaining intuition about what neural networks do.

Â So that let's go onto the next video about how to

Â view logistic regression as a very small neural network.

Â