We just looked at how dynamic programming can be used to iteratively evaluate a policy. We hinted that this was the first step towards the control task, or the goal is to improve a policy. In this video, we'll finally explain how this works. By the end of this video, you will be able to understand the policy improvement theorem, and how it can be used to construct improved policies, and use the value function for a policy to produce a better policy. Previously, we showed that given v star, we can find the optimal policy by choosing the Greedy action. The Greedy action maximizes the Bellman's optimality equation in each state. Imagine instead of the optimal value function, we select an action which is greedy with respect to the value function v Pi of an arbitrary policy Pi. What can we say about this new policy? That it is greedy with respect to v Pi. The first thing to note is that this new policy must be different than Pi. If this greedification doesn't change Pi, then Pi was already greedy with respect to its own value function. This is just another way of saying that v Pi obeys the Bellman's optimality equation. In which case, Pi is already optimal. In fact, the new policy obtained in this way must be a strict improvement on Pi, unless Pi was already optimal. This is a consequence of a general result called the policy improvement theorem. Recall the definition of q Pi. It tells you the value of a state if you take action A, and then follow policy Pi. Imagine we take action A according to Pi prime, and then follow policy Pi. If this action has higher value than the action under Pi, then Pi prime must be better. The policy improvement theorem formalizes this idea. Policy Pi prime is at least as good as Pi if in each state, the value of the action selected by Pi prime is greater than or equal to the value of the action selected by Pi. Policy pi prime is strictly better if the value is strictly greater and at least one state. Let's see how this works on the four-by-four grid rolled we use previously. Here's the final value function we found. Remember that this is the value function for the uniform random policy. Now, what might the greedy Pi policy look like? In each state, we need to select the action that leads to the next state with the highest value. In this case, the value that is least negative. Here's Pi prime. This is quite different from the uniform random policy we started with. Know that the value shown here do not correspond to the values for Pi prime. The new policy is guaranteed to be an improvement on the uniform random policy we started with according to the policy improvement theorem. In fact, if you look more closely at the new policy, we can see that it is in fact optimal. In every state, the chosen actions lie on the shortest path to the terminal state. Remember, the value function we started with was not the optimal value function, and yet the greedy policy with respect to v Pi is optimal. More generally, the policy improvement theorem only guarantees that the new policy is an improvement on the original. We cannot always expect to find the optimal policy so easily. That's it for this video. You should now understand that the policy improvement theorem tells us that greedified pi policy is a strict improvement, unless the original policy was already optimal. You should also now know how to use the value function under a given policy to produce a strictly better policy. Next time, we will discuss how to use this result to create an iterative dynamic programming algorithm to find the optimal policy. See you then.