Well, now let's proceed with the loss.

Once we have defined the loss,

we might want to minimize it.

One of the most simple and most widespread method of

minimizing the loss is by a means of gradient descent.

That is where differentiate our loss with respect to parameters w

and change our parameters in the direction of minus gradient with some stepsize of alpha.

Why minus gradient?

Well, because the gradient is defined as direction of the fastest function increase.

The opposite direction that is minus

a gradient shows the fastest decrease of the function.

And this allows us to change the parameters w so

that our loss function decreases in the fastest way possible.

However, to update parameters w with gradients descent,

again we should differentiate the whole loss.

And we know that we actually cannot even compute the loss.

We have only is sample-based estimate.

Well, in fact, what we can do we can

approximate a true gradient with it's Stochastic estimate.

That is we approximate the full sum with one component of that sum.

And this leads us to Stochastic gradient descent.

In SGD, Stochastic gradient descent,

we approximate the full gradient with it's

estimate in particular state and action as a day.

This state in action just in the same way

as we have discussed previously are sampled from

rho pi or rho behavior policy depending on whether we learn on or off policy.

To reiterate, sampling state and action from rho,

is in practice done with picking state and action

from Asian's experience of direction with an environment.

In practice, however, one place is one sample estimate

of the gradient with more stable estimate using a couple of samples,

not only the single one.

For now, we have talked about a gradient but have not showed how to compute it.

In fact, for application of SGD,

we should only know how to compute gradient of L supp s

and a which is a square root difference between the goal and our current estimate of it.

Why are we talking about it?

It isn't the square root difference easy to differentiate? Well, it is.

What is more tricky here is the dependence of goal on parameters w. These goals,

as I have mentioned, are simple numbers in case of Monte Carlo targets.

But there are a function of parameters w in case of temporal difference targets.

And we should differentiate them with respect to parameters w. This is

mathematically correct way to the learning but

it interferes with natural understanding of the task.

If it differentiate goals with respect to parameters,

we'll kind of change the natural flow of time.

That is, we will not only make value estimate of current state and

action look more similar to our target as a cumulative for war.

But we also will make the subsequent estimate of

return be dependent on the previous rewards.

That is not what we want in general.

Thus, we introduce a so-called semi grading methods which treat

goal as fixed and this for any particular type of goal,

that is the gradient of goal with respect to the parameters is equal to zero.

This semi-gradient approach is very similar to what is going on in usual

supervised learning where the goals are almost always fixed numbers.

Bootstrapping methods such as

temporal difference learning are not in fact instances of true gradient descent.

They include only part of the gradient.

Accordingly, we call them semi-gradient methods.

But on the other hand,

they simplify math alot and are shown to work well in many practical tasks.

Let me now summarize what are the properties of semi-gradient methods.

The essence of the semi-gradient update is that it treats goals g(s,

a ) as fixed numbers.

Like gradient update, this kind of

update change parameters in a way that moves estimates closer to targets.

But unlike gradient update,

it completely ignores effect of update on the target.

And because of this semi-gradient is a proper gradient.

It doesn't possess the convergent properties of Stochastic gradient descent,

but it convergences reliably in most cases.

It is also more computationally efficient and faster than true SGD.

Having said all this,

semi-gradient base are meaningful thing to do

because there are a type of parameter of update correspond

to symmetric structure of the task that is the time always goes only forward.

Let now return to the target definitions.

In fact, targets are deeply connected with the algorithm names.

More to say, these targets define what estimate do we learn and thus,

are very important to understand.

In SARSA, our target is current rewards,

that is immediate R(s,

a ) and gamma times our estimate

of actual value function in the next state and the next action.

This next action is basically sample from our policy pi.