0:54

So, if F corresponds to maybe some probability and

you're trying to find the most probable set of Xs.

Then, you would probably the maximum of f,

however if f were some cost function, and X1 through Xn are some parameters

of that cost function, then you would be trying to find the minimum.

1:35

it's hard to draw very high-dimensional functions.

But what we can do, Is to draw a function of two arguments and

then claim that all the geometry we're going to talk about and all of the math

we're going to talk about holds for functions of more than two arguments.

So let's say we have f(X1, X2),

and let's say it looks something like this,

so here is X1, here is X2, draw F.

2:36

Now let's pretend that F is a probability function.

A probability density and we are trying to find the most probable X1 and X2.

And that corresponds to the X2 and X1 where F is maximized so

on this picture you can see the maximum right there, so that's the maximum.

And when you see the whole picture like this,

it's actually fairly easy to figure out where the maximum is.

3:25

However, what you can do is sample the function at different pairs of Xs.

What that means is I can provide the function with an X1, an X2 and

see what the value of the function is at that X1 and X2.

And if I could do this with every X1 and

X2, it would be very easy to see what the maximum of the function was.

I would just choose the X1 and X2 that yielded the greatest value of F.

However, this is often infeasible

especially if we have a very high dimensional space of inputs.

So what does one do?

One possibility would be to choose a bunch of random pairs of X1 and

X2, and see which one yielded the highest path.

So maybe we would choose an X1, X2 that yielded that F right there.

Maybe one that yielded that F, maybe one over there,

maybe one there, maybe one there.

And then at the end of the day,

once we have chosen a finite number of random X1, X2 pairs.

We would just say okay, which of those pairs yielded the highest value of F?

And we'll say that's the maximum.

However, that's actually not going to work very well and

you can probably see why that's true based on this drawing.

When you're just randomly sampling,

you're using no strategy to try to figure out where the maximum might be.

You're just choosing randomly, so can we do something a little bit better?

Instead, we're going to use a method called gradient ascent and

how gradient ascent works is that you pick a starting point.

And then instead of just randomly moving to any other X1,

X2, you instead move to an X1 X2 pair that is in your little neighborhood.

So, you only move to points that are nearby.

So this is very similar to what you would do if you were

maybe in a foggy city or in a city at night where you couldn't see very well.

And you were task with finding the highest point of the city.

You obviously, can't randomly jump around to take different

samples of different points of the city Instead you can only move locally.

All right, so now that you're constrained to only move locally,

to only be able to move to locations very nearby to where you currently are.

What's the well, if I were trying to find a highest point in the city,

what I would do is I would always walk in the steepest direction.

6:36

And then, I would keep going and keep going and keep going and

keep going until I got to the top and how would I know when I got to the top?

Well, every direction would be equally steep, and the steepness would be zero,

because it would be completely flat in every direction.

So hopefully, that makes sense intuitively, and

in fact that is all that gradient ascent does.

7:21

Well it has to do with what kind of calculation we do

to find the steepest direction.

So let's introduce the notion of the gradient, so

that's our symbol for the gradient.

A little upside down triangle and so if we were to take

the gradient of f, that is just equal to df/dx,

7:48

df, or sorry df/dx1, df/dx2,

and so forth all the way to df/dxn.

So, the gradient is a vector each of whose entries

is the partial derivative of f with respect to one of f's arguments.

So, if f has n arguments and

inputs, then the gradient is an n dimensional vector.

10:49

What about the next step?

Well, we calculate the gradient at X1,

X2 sub 1 and then our next step Leads us to,

where we were previously (X1,

X2)1 + alpha X the gradient.

11:16

The new gradient, if you're doing this right, the gradient might

change as you move to different locations and so this is the algorithm.

At each point you take a step in the direction of the gradient,

which is the steepest direction and that step size is alpha, And so,

you can keep on doing this over and over and over again And then when do you stop?

12:07

Of the gradient is equal to the steepness of the slope,

so if the gradient is very large

then you have a very steep slope.

And if the gradient is 0, then you have a 0 slope and what does a 0 slope mean?

Well, that means that you've reached a maximum, so

that's how gradient ascent works.

12:39

You pick a starting point and then you follow the gradient to the top.

This can have a few problems unfortunately so

in this picture we have drawn for example.

Which peak you reach we depend on your starting location,

so if we started here we will probably reach that peak.

However, if we start somewhere over here we will reach this peak

which is a local maximum.

So, if your goal is to find the global maximum of your function using

gradient ascent, it's usually a good idea to run the algorithm a few

times from random starting locations.

And each time you run the algorithm you will

get some final X1 X2 all the way Xn set of inputs.

That yielded a local maximum or

the cause to the gradient ascent algorithm can stop running.

And so, for

each time you run the algorithm you'll have a set of these X1 through Xn's.

And then, all you do is pick the set of X1 through Xn that yielded the highest

value of f, so that's it, that's the gradient ascent optimization method.

If you're trying to do gradient descent, so maybe trying to find the minimum of

a cost function for example, then you're just walking down the hill.

So you do the same thing as in gradient ascent but

you will go in the direction opposite of the gradient and

again, you keep taking steps until the gradient reaches 0.

And again, we've gone through our examples in two dimensions,

so with two arguments for our function f, but

the same general idea holds for higher dimensions.

14:38

And when the gradient is equal to zero that means we are at a maximum or

a minimum and that's it.

This is just one of the many optimization efforts there are for

optimizing a function of many arguments.

And in many cases it works pretty well, so if you're trying to find the ideal

network waves for a feed forward neural network which you're trying to use for

classification for example.

Gradient assent does a pretty good job.

However, keep in mind that it is not the be all end all solution.

There are still many varieties of functions which are not maximized

particularly well with gradient offset

15:21

However, one good check to see if gradient assent is working is to pick

a bunch of random starting locations.

And if they all converge to the same maximum,

then that suggests that f is actually fairly well behaved and

may have a global maximum that is relatively easy to find.

All right, and that's it for gradient ascent and descent.

Thanks for watching.