0:00

We previously defined the cost function J.

In this video, I want to tell you about an algorithm called gradient descent for

minimizing the cost function J.

It turns out gradient descent is a more general algorithm, and

is used not only in linear regression.

It's actually used all over the place in machine learning.

And later in the class, we'll use gradient descent to minimize

other functions as well, not just the cost function J for the linear regression.

0:41

So here's the problem setup.

Going to assume that we have some function J(theta 0,

theta 1) maybe it's the cost function from linear regression,

maybe it's some other function we wanna minimize.

And we want to come up with an algorithm for

minimizing that as a function of J(theta 0, theta 1).

Just as an aside it turns out that gradient descent actually applies to more

general functions.

So imagine, if you have a function that's a function of J, as theta 0, theta 1,

theta 2, up to say some theta n, and you want to minimize theta 0.

You minimize over theta 0 up to theta n of this J of theta 0 up to theta n.

And it turns our gradient descent is an algorithm for

solving this more general problem.

But for the sake of brevity, for

the sake of succinctness of notation, I'm just going to pretend I have only

two parameters throughout the rest of this video.

Here's the idea for gradient descent.

What we're going to do is we're going to start off with some initial guesses for

theta 0 and theta 1.

Doesn't really matter what they are, but a common choice would be we set theta

0 to 0, and set theta 1 to 0, just initialize them to 0.

What we're going to do in gradient descent is we'll keep changing theta 0 and

theta 1 a little bit to try to reduce J(theta 0, theta 1), until hopefully,

we wind at a minimum, or maybe at a local minimum.

2:09

So let's see in pictures what gradient descent does.

Let's say you're trying to minimize this function.

So notice the axes, this is theta 0,

theta 1 on the horizontal axes and J is the vertical axis and

so the height of the surface shows J and we want to minimize this function.

So we're going to start off with theta 0, theta 1 at some point.

So imagine picking some value for theta 0, theta 1, and

that corresponds to starting at some point on the surface of this function.

So whatever value of theta 0, theta 1 gives you some point here.

I did initialize them to 0, 0 but

sometimes you initialize it to other values as well.

2:49

Now, I want you to imagine that this figure shows a hole.

Imagine this is like the landscape of some grassy park,

with two hills like so, and I want us to imagine that you are physically

standing at that point on the hill, on this little red hill in your park.

In gradient descent, what we're going to do is we're going to spin 360 degrees

around, just look all around us, and ask, if I were to take a little baby step in

some direction, and I want to go downhill as quickly as possible,

what direction do I take that little baby step in?

If I wanna go down, so

I wanna physically walk down this hill as rapidly as possible.

3:31

Turns out, that if you're standing at that point on the hill, you look all around and

you find that the best direction is to take a little step downhill

is roughly that direction.

Okay, and now you're at this new point on your hill.

You're gonna, again, look all around and

say what direction should I step in order to take a little baby step downhill?

And if you do that and take another step, you take a step in that direction.

4:14

This first time we ran gradient descent we were starting at this point over

here, right?

Started at that point over here.

Now imagine we had initialized gradient descent just a couple steps to the right.

Imagine we'd initialized gradient descent with that point on the upper right.

If you were to repeat this process, so start from that point, look all around,

take a little step in the direction of steepest descent, you would do that.

Then look around, take another step, and so on.

4:54

So if you had started this first point, you would've wound up at this local

optimum, but if you started just at a slightly different location,

you would've wound up at a very different local optimum.

And this is a property of gradient descent that we'll say

a little bit more about later.

So that's the intuition in pictures.

Let's look at the math.

This is the definition of the gradient descent algorithm.

We're going to just repeatedly do this until convergence,

we're going to update my parameter theta j by taking theta j and

subtracting from it alpha times this term over here, okay?

So let's see, there's lot of details in this equation so let me unpack some of it.

First, this notation here, :=, gonna use := to

denote assignment, so it's the assignment operator.

So briefly, if I write a := b, what this means is, it means in a computer,

this means take the value in b and use it overwrite whatever value is a.

So this means set a to be equal to the value of b, which is assignment.

And I can also do a := a + 1.

This means take a and increase its value by one.

Whereas in contrast, if I use the equal sign and

I write a equals b, then this is a truth assertion.

6:24

Okay? So if I write a equals b,

then I'll asserting that the value of a equals to the value of b, right?

So the left hand side, that's the computer operation,

where we set the value of a to a new value.

The right hand side, this is asserting, I'm just making a claim that the values of

a and b are the same, and so whereas you can write a := a + 1, that means increment

a by 1, hopefully I won't ever write a = a + 1 because that's just wrong.

a and a + 1 can never be equal to the same values.

Okay?

So this is first part of the definition.

This alpha here is a number that is called the learning rate.

7:08

And what alpha does is it basically controls

how big a step we take downhill with creating descent.

So if alpha is very large, then that corresponds to a very aggressive gradient

descent procedure where we're trying take huge steps downhill and

if alpha is very small, then we're taking little, little baby steps downhill.

And I'll come back and say more about this later, about how to set alpha and so on.

7:32

And finally, this term here, that's a derivative term.

I don't wanna talk about it right now, but I will derive this derivative term and

tell you exactly what this is later, okay?

And some of you will be more familiar with calculus than others, but

even if you aren't familiar with calculus, don't worry about it.

I'll tell you what you need to know about this term here.

7:55

Now, there's one more subtlety about gradient descent which is in

gradient descent we're going to update, you know, theta 0 and theta 1, right?

So this update takes place for j = 0 and j = 1, so

you're gonna update theta 0 and update theta 1.

And the subtlety of how you implement gradient descent is for

this expression, for this update equation,

you want to simultaneously update theta 0 and theta 1.

What I mean by that is that in this equation, we're gonna update theta 0 :=

theta 0 minus something, and update theta 1 := theta 1 minus something.

And the way to implement is you should compute the right hand side, right?

Compute that thing for theta 0 and theta 1 and then simultaneously,

at the same time, update theta 0 and theta 1, okay?

So let me say what I mean by that.

This is a correct implementation of gradient descent meaning simultaneous

update.

So I'm gonna set temp0 equals that, set temp1 equals that so

basic compute the right-hand sides, and then having computed the right-hand

sides and stored them into variables temp0 and temp1, I'm gonna update theta 0 and

theta 1 simultaneously because that's the correct implementation.

9:34

And the difference between the right hand side and

the left hand side implementations is that If you look down here,

you look at this step, if by this time you've already updated theta 0,

then you would be using the new value of theta 0 to compute this derivative term.

And so this gives you a different value of temp1, than the left-hand side, right?

Because you've now plugged in the new value of theta 0 into this equation.

And so, this on the right-hand side is not a correct implementation

of gradient descent, okay?

So I don't wanna say why you need to do the simultaneous updates.

It turns out that the way gradient descent is usually implemented,

which I'll say more about later,

it actually turns out to be more natural to implement the simultaneous updates.

And when people talk about gradient descent,

they always mean simultaneous update.

If you implement the non simultaneous update,

it turns out it will probably work anyway.

But this algorithm wasn't right.

It's not what people refer to as gradient descent, and

this is some other algorithm with different properties.

And for various reasons this can behave in slightly stranger ways, and so

what you should do is really implement the simultaneous update of gradient descent.

So, that's the outline of the gradient descent algorithm.

In the next video, we're going to go into the details of the derivative term,

which I wrote up but didn't really define.

And if you've taken a calculus class before and if you're familiar with partial

derivatives and derivatives, it turns out that's exactly what that derivative term

is, but in case you aren't familiar with calculus, don't worry about it.

The next video will give you all the intuitions and

will tell you everything you need to know to compute that derivative term, even if

you haven't seen calculus, or even if you haven't seen partial derivatives before.

And with that, with the next video, hopefully we'll

be able to give you all the intuitions you need to apply gradient descent.