0:00

Now, let's recap what we said in the previous course

about two classical methods to solve the Bellman optimality equation.

We can apply these methods for time homogeneous market decision processes.

As we saw in the last video, for such process,

then the value function becomes independent of time.

So, in this case,

the Bellman optimality equation becomes a non-linear equation for a single function,

that can be solved using simple numerical methods.

One such method is called value iteration.

As long as the state-action space is discrete and small,

value iteration provides a simple and quick solution to the problem.

It works as follows,

we start with a initialization of

the value function at some initial values for all states.

Usually, initialization at zero for all states works just fine.

Then we keep iterating the calculation of

the value function using the Bellman equation itself.

Where for each iteration,

we use the result of the previous iteration to

compute the right-hand side of this equation.

Now, there are a few ways to update the value function in such value iteration.

One way would be to complete computing the value function for

all states in our state space and then update develop function in all states at once.

This is called synchronous updates and

the other way would be to update the value function on the fly.

Lets say computed in the current iteration and this would be called asynchronous updates.

Now, for either of those two ways of iterating can

be proven that the algorithm converges to the optimal value function,

that we will again restart.

And after restart is found,

the optimal policy can be found using the same formula as before.

So the basic algorithm is very simple and as we said,

it works well as long as your state-action space is

discrete and has a small number of states.

Now there is another Classical method that we also outlined in

the previous course which is called the policy iteration.

In this method, we start with some initial policy pie 0,

which is usually chosen randomly.

And after that, we repeat

the following loop with

two calculations are made at each step of this loop until convergence.

The first calculation is called a policy validation.

In this calculation, we compute the value function for a given policy for this iteration.

We can do this using the Bellman equation for V

and not the Bellman optimality equation for this star.

And because the Bellman equation for V just

the linear equation this step is straight forward.

The second calculation on the same step of the loop is called policy update.

This is done again by applying

the arg max operator to the right hand side of the Bellman equation.

Now, please note that these calculations have to be done

repeatedly within policy iteration method

and they have quite a different computational burden.

The policy updates there is a straight forward and

can be done using standard optimization software.

But the policy evaluation step involves solving the Bellman equation.

So, in case when dimensionality of the state space is large,

doing it repeatedly in the process of policy evaluation can be costly.

On the other hand,

many practical problems of optimal control in

both large discrete state-action spaces or even continuous state-action spaces.

And in these settings,

methods of dynamic programming and algorithms

like value iteration or policy iteration do not work anymore.

And reinforcement learning comes to the rescue here.

We will see how it works in the next week of

this course after we introduce our first mark of

decision process model in the next lesson.