0:02

So, how to compare to policies with each other.

One natural measure of policy performance is how

much of the reward could the policy get out of the environment.

In fact, we will say that policy pi is greater or equal than pi prime,

if, and only if,

the value function of policy pi is

greater or equal to the value function of policy pi prime,

for each and every state.

This comparison criteria, naturally yield the definition of the optimal policy.

We'll say that the policy pi star is the optimal policy if,

and only if, it is greater or equal to any other possible policy.

But what if for some environment the optimal policy does not exist?

What if for some states the one policy is better,

but for the states,

the other policy is on top?

Actually, for an Markov decision process this cannot be the case.

There is a beautiful theorem stating that in any finite Markov decision process,

there exists at least one optimal policy.

In addition, at least one of optimal policies should be deterministic.

The definition of optimal policy also uses

a definition of optimal value and optimal action value functions.

In fact, on the optimal value function,

denoted v star of s,

is defined as a value of the best possible policy.

That is the value of the v star,

is equal to the value of the best possible policy that starts making it's

decisions from state s. The same definition is true for the action value function.

However, the interpretation of

optimal action value function is a bit different from the optimal value function.

The optimal action value corresponds to

the expected return that is gained by committing action a instead s,

and then, and only then,

switching to the optimal policy for the rest of an episode.

Optimal value and action value functions,

also admit a recursive computation form.

This form is known as Bellman optimality equation.

In fact, the Bellman optimality equation is

a slight modification of Bellman expectation equation.

And now we are going to discover the differences between them.

On the slide, there are two copies of backup diagram

corresponding to Bellman expectation equation for value function.

Now we are going to modify

the right one to make it match the Bellman optimality equation.

So, when do we need to change in

the right diagram to represent the optimal value function computation?

Well, but then we have already computed the optimal values for leaves of the tree,

that is, we know v star of s prime for each the next state as prime.

Now recall that x on the bottom level,

correspond to environment probabilities over which we have no control.

That is we have to propagate the value from the bottom of

the tree to action node in the same way we have done it before.

In other words, we have to account for the average environment influence on the agent.

However, over action selection on the top level of the diagram,

we do to have control,

so we can compare the values in

the action nodes and propagate back to the root on the highest value.

Paraphrasing, we don't need to compute an expectation on the top level of the tree.

We should just put it up the highest value among the action root of the sub-trees.

This procedure of taking max at the top of the back up diagram,

exactly corresponds to the Bellman optimality equation.

There exists another point of view on the Bellman optimality equation.

Now that the outer max operator in the last formula,

has replaced the outer expectation.

I computed the report specificity in the Bellman expectation equation.

However, we can unify both approaches,

if we treat the max operator as an expectation.

That is an expectation with respect to

the policy which is greedy with respect to the inner expectation.

I mean, the expectation over its distribution that is equal to one for it's

action corresponding to the highest inner expectation and zero for all other actions.

But I also note that the inner expectation,

that is everything that stays to the right of

the max operator is nothing but an optimal action value function.

So, to wrap up, the optimality equation

is in fact expectation equation whereas policy pi,

replaced with the greedy policy,

greedy with respect to the optimal action value function.

The same logic applies to the Bellman optimality equation for action value function.

The top level acts of the back up diagram for

q function consists of environment reactions,

thus we have no control over top level acts,

because we have no control over environment probabilities.

So, we have to compute the expectation on the top level of the tree.

However, we do have control over action probabilities on the bottom level.

And just like for the value function,

here we also replace the expectation with a max operator.

And again, this max operator could be seen as an expectation of a policy,

which is greedy with respect to the optimal action values of the least.

Well, as you might have noticed, during this week,

there were a lot about expressing their solution

on the value computation problem in the recursive session.

That was basically, the dynamic programming in action.

However, there is a lot more to be discussed.

So, we know what are Bellman expectation and optimality equations.

And that knowledge is fundamental for the rest of

the course and reinforcement learning in general.

In the next videos, we are going to use these equations to

actually find an optimal policy. So, stay tuned.