In the previous video,

we introduced the value function and the optimal value function,

and discussed how it can be found by

a numerical solution of the Bellman optimality equation.

The optimal policy can be computed from

the optimal value function once the letter is computed.

But it turns out that such approach is not the only possible approach,

and there is a more convenient approach that

as soon as goal is of Reinforcement Learning better.

This approach is based on their so-called action-value function.

Let's see how the action-value function is different from the value function.

Let's first recall that the value function at the given time

depends only on the state S and policy pi.

It does not explicitly contain information about an action that the agent takes now.

The action-value function Q is a function of

both the current state ST and the action AT taken in this state.

It computes the same expected cumulative reward as the value function does,

but conditions suit on both ST and AT.

Similar to the value function,

the action-value function Q or simply the Q-function depends on the policy pi,

two arguments S and A of the Q-function of the current state and current action.

This means that in the Q-function,

the first action is fixed and given by the value of A,

while all later actions are determined by policy pi.

We can use this definition in order to obtain the Bellman equation for

the Q-function in a similar way to how

we obtain the Bellman equation for the value function.

What we have to do is again to split the sum into the first term and the rest.

The first term is a reward from taking the action A now.

Because the action A is fixed as it's an argument of the Q-function,

the reward is fixed as well.

Now, the second term is the same sum,

but starting with the next time stamp.

As we just said,

this term does not depend on A,

it depends only on policy pi.

So, it's simply the value function for

the same problem but start with the next time format.

And this gives us a Bellman equation for the Q-function which is shown here.

It expresses the Q-function at time T in terms of a reward at

the same time T plus a discounted expected

value of the value function at the next moment.

Such equation would not be very convenient to work with

as it now involves two unknown functions instead of just one.

But a more convenient equation can be obtained if we looked at the optimal Q-function.

Now, the optimal Q-function is defined in the same way as the optimal value function.

Mainly the optimal action-value function Q star is

the maximum of all Q-functions and all possible policies pi.

So, the optimal Q-function says something like that,

take action A now and then follow the optimal policy P star later.

Now, let's compare it with the definition of the optimal value function,

which essentially says, take

the optimal action as determined by the optimal policy both now and later.

But this means that the optimal value function V star is simply the maximum of

the optimal action-value function Q star over all possible actions A taken now.

In other words, it's a maximum of Q star with respect to its second argument.

Now, we can use these definitions in order to obtain

a Bellman equation that involves the optimal action-state function Q star.

First, we take the maximum over all policies pi in the Bellman equation for

the Q-function and use the definition of Q star.

This produces the first equation on the bottom of the slide.

This equation relates Q star and V star.

Now, we use the definition of V star as a maximum of Q star over all actions A,

and this produces the last equation that involves only the Q star function.

This equation is called the Bellman optimality equation for the action-state function Q.

This is the equation that we'll be working

on in the next week when we look into Q learning.

For now, let me make a few comments on how the state value function

or the action-state value function can be estimated from experience,

following the machine learning paradigm.

An agent can follow policy pi and maintain the average reward for each state it visited.

If the number of visits to each state is large enough,

then the average will converge to the value function for the state.

We can also compute such averages for separate actions taken at each step.

And in this case,

the averages will converge to the values of the action-state value function.

In Reinforcement Learning, estimations of these kinds are called Monte Carlo methods.

They are simple enough as long as the number of states and actions is low.

But if this number is large or if states or actions are continuous,

it may not be practical to keep separate averages for each state action combination.

In such cases, we should rely on

some function approximation methods for the action-state value function.

We will get back to these topics later on in this course.

But for now, I wanted to show you

one of the most classical examples of Reinforcement Learning,

the Pole Balancing Problem.

The task for this problem is to apply horizontal forces to

a cart that can move to the left or right on the track of some lengths,

so as to keep pole hinged to the car from falling from the initial vertical position.

Failure happens when the pole falls past a given angle from the vertical line.

Now, this can be formulated as an episodic task for

a specified time period C. For each time step one failure did not occur,

the agent gets the reward of plus one.