0:00

In this video, we'll talk about how to fit the parameters of theta for

the logistic compression.

In particular, I'd like to define the optimization objective, or

the cost function that we'll use to fit the parameters.

0:15

Here's the supervised learning problem of fitting logistic regression model.

We have a training set of m training examples and as usual,

each of our examples is represented by a that's n plus one dimensional,

0:32

and as usual we have x o equals one.

First feature or a zero feature is always equal to one.

And because this is a computation problem,

our training set has the property that every label y is either 0 or 1.

This is a hypothesis, and the parameters of a hypothesis is this theta over here.

And the question that I want to talk about is given this training set,

how do we choose, or how do we fit the parameter's theta?

Back when we were developing the linear regression model,

we used the following cost function.

I've written this slightly differently where instead of 1 over 2m,

I've taken a one-half and put it inside the summation instead.

Now I want to use an alternative way of writing out this cost function.

Which is that instead of writing out this square of return here,

let's write in here costs

of h of x, y and

I'm going to define that total cost of h of x, y to be equal to this.

Just equal to this one-half of the squared error.

So now we can see more clearly that the cost function is a sum over my training

set, which is 1 over n times the sum of my training set of this cost term here.

1:56

And to simplify this equation a little bit more,

it's going to be convenient to get rid of those superscripts.

So just define cost of h of x comma y to be equal to

one half of this squared error.

And interpretation of this cost function is that, this is the cost I want my

learning algorithm to have to pay if it outputs that value,

if its prediction is h of x, and the actual label was y.

So just cross off the superscripts, right,

and no surprise for linear regression the cost we've defined is that or

the cost of this is that is one-half times the square difference between what

I predicted and the actual value that we have, 0 for y.

Now this cost function worked fine for linear regression.

But here, we're interested in logistic regression.

If we could minimize this cost function that is plugged into J here,

that will work okay.

But it turns out that if we use this particular cost function,

this would be a non-convex function of the parameter's data.

Here's what I mean by non-convex.

Have some cross function j of theta and

for logistic regression, this function h here

3:12

has a nonlinearity that is one over one plus e to the negative theta transpose.

So this is a pretty complicated nonlinear function.

And if you take the function, plug it in here.

And then take this cost function and plug it in there and

then plot what j of theta looks like.

You find that j of theta can look like a function that's like this

3:33

with many local optima.

And the formal term for this is that this is a non-convex function.

And you can kind of tell, if you were to run gradient descent on this sort of

function It is not guaranteed to converge to the global minimum.

Whereas in contrast what we would like is to have a cost function j of theta that is

convex, that is a single bow-shaped function that looks like this so

that if you run theta in the we would be guaranteed that

4:01

would converge to the global minimum.

And the problem with using this great cost function is that because of this

very nonlinear function that appears in the middle here, J of theta ends

up being a nonconvex function if you were to define it as a square cost function.

So what we'd like to do is, instead of come up with a different cost function,

that is convex, and so that we can apply a great algorithm,

like gradient descent and be guaranteed to find the global minimum.

Here's the cost function that we're going to use for logistic regression.

We're going to say that the cost, or

the penalty that the algorithm pays, if it upwards the value of h(x),

so if this is some number like 0.7, it predicts the value h of x.

And the actual cost label turns out to be y.

The cost is going to be -log(h(x)) if y = 1 and

-log(1- h(x)) if y = 0.

This looks like a pretty complicated function, but

let's plot this function to gain some intuition about what it's doing.

Let's start off with the case of y = 1.

If y = 1, then the cost function is -log(h(x)).

And if we plot that, so let's say that the horizontal axis is h(x),

so we know that a hypothesis is going to output a value between 0 and 1.

Right, so h(x), that varies between 0 and 1.

If you plot what this cost function looks like, you find that it looks like this.

One way to see why the plot looks like this is because if you were to plot log z

5:45

with z on the horizontal axis, then that looks like that.

And it approaches minus infinity, right?

So this is what the log function looks like.

And this is 0, this is 1.

Here, z is of course playing the role of h of x.

And so -log z will look like this.

6:06

Just flipping the sign, minus log z, and we're interested only in the range

of when this function goes between zero and one, so get rid of that.

And so we're just left with, you know, this part of the curve, and

that's what this curve on the left looks like.

Now, this cost function has a few interesting and desirable properties.

First, you notice that if y is equal to 1 and h(x) is equal to 1,

in other words, if the hypothesis exactly predicts h equals 1 and

y is exactly equal to what it predicted, then the cost = 0 right?

That corresponds to the curve doesn't actually flatten out.

The curve is still going.

First, notice that if h(x) = 1, if that hypothesis

predicts that y = 1 and if indeed y = 1 then the cost = 0.

That corresponds to this point down here, right?

If h(x) = 1 and we're only considering the case of y = 1 here.

But if h(x) = 1 then the cost is down here, is equal to 0.

And that's where we'd like it to be because if we correctly predict the output

y, then the cost is 0.

But now notice also that as h(x) approaches 0, so as

the output of a hypothesis approaches 0, the cost blows up and it goes to infinity.

And what this does is this captures the intuition that if a hypothesis of 0,

that's like saying a hypothesis saying the chance of y equals 1 is equal to 0.

It's kinda like our going to our medical patients and saying

the probability that you have a malignant tumor, the probability that y=1, is zero.

So, it's like absolutely impossible that your tumor is malignant.

7:55

But if it turns out that the tumor, the patient's tumor, actually is malignant, so

if y is equal to one, even after we told them,

that the probability of it happening is zero.

So it's absolutely impossible for it to be malignant.

But if we told them this with that level of certainty and we turn out to be wrong,

then we penalize the learning algorithm by a very, very large cost.

And that's captured by having this cost go to

infinity if y equals 1 and h(x) approaches 0.

This slide consider the case of y equals 1.

Let's look at what the cost function looks like for y equals 0.

8:32

If y is equal to 0, then the cost looks like this,

it looks like this expression over here, and if you plot the function,

-log(1-z), what you get is the cost function actually looks like this.

So it goes from 0 to 1, something like that and so if you plot

the cost function for the case of y equals 0, you find that it looks like this.

And what this curve does is it now goes up and

it goes to plus infinity as h of x goes to 1 because as I was saying,

that if y turns out to be equal to 0.

But we predicted that y is equal to 1 with almost certainly, probably 1,

then we end up paying a very large cost.

And conversely,

if h of x is equal to 0 and y equals 0, then the hypothesis melted.

The protected y of z is equal to 0, and it turns out y is equal to 0,

so at this point, the cost function is going to be 0.

In this video, we will define the cost function for a single train example.

The topic of convexity analysis is now beyond the scope of this course, but

it is possible to show that with a particular choice of cost function,

this will give a convex optimization problem.

Overall cost function j of theta will be convex and local optima free.

In the next video we're gonna take these ideas of the cost function for

a single training example and develop that further, and define the cost function for

the entire training set.

And we'll also figure out a simpler way to write it than we have been using so

far, and based on that we'll work out grading descent, and

that will give us logistic compression algorithm.