So this shows that this definition for the cost is just a more compact way

of taking both of these expressions, the cases y =1 and y = 0,

and writing them in a more convenient form with just one line.

We can therefore write all our cost functions for

logistic regression as follows.

It is this 1 over m of the sum of these cost functions.

And plugging in the definition for

the cost that we worked out earlier, we end up with this.

And we just put the minus sign outside.

And why do we choose this particular function,

while it looks like there could be other cost functions we could have chosen.

Although I won't have time to go into great detail of this in this course,

this cost function can be derived from statistics

using the principle of maximum likelihood estimation.

Which is an idea in statistics for

how to efficiently find parameters' data for different models.

And it also has a nice property that it is convex.

So this is the cost function that

essentially everyone uses when fitting logistic regression models.

If you don't understand the terms that I just said, if you don't know what

the principle of maximum likelihood estimation is, don't worry about it.

But it's just a deeper rationale and justification

behind this particular cost function than I have time to go into in this class.

Given this cost function, in order to fit the parameters, what we're

going to do then is try to find the parameters theta that minimize J of theta.

So if we try to minimize this, this would give us some set of parameters theta.

Finally, if we're given a new example with some set of features x, we can then take

the thetas that we fit to our training set and output our prediction as this.

And just to remind you, the output of my hypothesis I'm going to interpret as

the probability that y is equal to one.

And given the input x and parameterized by theta.

But just, you can think of this as just my hypothesis as estimating the probability

that y is equal to one.

So all that remains to be done is figure out how to actually minimize J of theta

as a function of theta so

that we can actually fit the parameters to our training set.

The way we're going to minimize the cost function is using gradient descent.

Here's our cost function and if we want to minimize it as a function of theta,

here's our usual template for

graded descent where we repeatedly update each parameter by taking,

updating it as itself minus learning ray alpha times this derivative term.

If you know some calculus, feel free to take this term and try to compute

the derivative yourself and see if you can simplify it to the same answer that I get.

But even if you don't know calculus don't worry about it.