Speaking of those policy-based algorithm we've just learned,

there's more than one way you can actually tune them to be more efficient or run

smoother by using the intuitive approach, introducing some heuristics.

You've probably learned about some of them already,

discover them again just to make sure that you've got them.

First, if your'e using the at actor critic method,

let's say adventure critic,

have to leverage the importance of two losses it has.

The first loss is the policy-based loss, the policy gradient,

the second one is that you have

to train your critic to minimize the temporal difference loss.

The idea here is that kind of more or less in the majority of cases,

you can assume the value-based loss,

the temporal difference loss to be less important.

This is because if your'e having a perfect critic,

but a terrible actor,

you are still having a critic which estimates how well does a random agent performs.

But if you have a good actor and some random critic,

you still have an algorithm which is at least as good as the reinforce.

The idea is that you can express this intuition by

reducing the comparative weight of the value-based loss.

You can just multiply it by some number less than one.

To another important part is that whenever you try

to apply policy-based methods in practice,

you might end up with a situation whereby some particular query can be a policy.

If say, the gradient of explosion if you are using neural networks,

we end up with algorithm that completely

abandons one action in at least a subset of situations.

This is basically a vicious circle because in this case,

you'll probably have your algorithm only train on

the actions it has just produced because most them are on-policy here,

and this case, you won't be able to learn to dig this action ever again.

So, you if you have abandoned an action,

you're no longer receiving samples consisting of this action in some particular state.

You are no longer able to kind of

forgive the notion that this action might be optimal sometimes.

Of course, if you're dead sure that this action is useless, it's okay to drop it,

but in other cases, you have to,

in the future algorithm that it should not completely give up on actions.

As we had already done in the cross entropy method section the very first week,

there is a way to do so with neural networks by introducing

a loss that kind of regularizes the policy.

This case, you can use for example, the negative entropy.

What you want to do is, you want to encourage your agent

to increase the entropiness policy here with of course,

some very small coefficient.

And if you remember entropy works,

this is basically resulting your agent

preferring to not give a probability of zero to anything.

Of course, this requires you to change to another parameter,

but as long as is safe to assume that if you have

a sufficiently small but non-zero coefficient between multiplied by the entropy,

you'll probably have your agent kind of forget

this malicious policy of not taking

an action after at least some large fixed amount of iterations.

This is the weak guarantee,

but you're probably not going to get anything better with approximate methods.

Another thing, as we have already discussed in the [inaudible] section,

you can take advantage of the fact that in the modern world,

almost anything including a smartphone probably has more than one CPU core in it.

The idea here is that if you have parallel sessions,

you can parallelorize the sampling procedure.

You can basically train your algorithm on

sessions that are obtained by relying on such environments and such a parallel course.

Or you can go even further by training in parallel and averaging core basically,

synchronizing weights as it was done in the A3C.

Finally, just a very tiny, teeny,

technical query concerning neural networks only,

or well, neural network correlated the most.

Using policy gradient, you probably required to construct a formula which uses

the logarithm of the probability of taking

an action eigen state S multiplied by your advantage,

or a word depending on what logarithm use,

and in deep learning framework,

you probably have to do so a bit more carefully than your otherwise can.

Especially here is that, if you seem to take the probability here,

and then take the logarithm of this probability here,

other frameworks can get this probability in a very inefficient way.

Basically, if you use for such a deep precision,

you may end up with a probability which rounds up to almost zero,

and the logarithm of almost zero is almost negative infinity.

You can mitigate this with the logsoftmax formula.

The need here is that you, if you explicitly write down

the formula of the logarithm of your softmax non linearity,

you will end up with a formula which is much simpler

than if you just take a multiple other derivatives.

Now, this is what you are going to do in the practice session,

so don't worry if you have not addressed this concept entirely from the first attempt.