Now, this leads us to another familiar for Reinforcement algorithms, they called it, Actor-critic algorithms. The idea behind Actor-critic is that it learns pretty much everything you're able to run by this moment in course. It has a value-based part, where it tries to approximate the value function, but it also uses the policy function here, which can also be approximated the same way. The idea is that by combining value function policy function, they can obtain a better learning performance and some of the properties that say, reinforce algorithm Lax. Now let's push back to the veil of mystery a little bit, and next we describe all of those algorithms in more detail. This time we're going to focus on the Advantage actor-critic. Some of you that developed sense of intuition have probably noticed that it has this advantage term here, and as your intuition suggests, this advantage is, after the idea that we are going to learn this, differences between your current key function and the average performance in this state. So again, this algorithm learns both your policy and the value function, and it actually learns a value function to improve the performance of policy training. To understand how this is possible, I want you to answer a question for me. Say you've just sampled this State- Action-Reward and next state tuple from the environment, and you've also learn the V-function, so for every stage you're able to get the exact amount of expected discounted returns. The idea here is that, you want to use this information to compute the advantage, so you want to either produce the advantage itself or get some kind of unbiased estimate of it, how would you do that? Well, it turns out that, if you remember the properties of the functions of Q and V, it's kind of easy, just requires three lines of math. You start by writing out the advantage, as a difference between your action value function, the Q function, and the expectation of value function. Now, this expectation of value function is just the Q function for the simplicity. To kind of make this formula easier to compute, you have to remember that the Q function can be rewritten as a reward, plus gamma the discount times the value function of the next state. This of course requires expectation over all possible next states, but you can take just one sample from the environment, just like you did in Q-learning. Now, the advantage becomes, the difference between the Q function and the V function, which is, as this line suggests, is just the difference between the, reward plus gamma times the value of s prime, minus the value of s. Now, this is how you can use just the V function to estimate your advantage, and presumably learn better. And this allows us to do this very simple substitutional formulae which actually brings a lot of improvements to us. We just take the Q function and place it with the Q minus C, also known as the advantage function. So the [inaudible] changed ever so slightly, but this allows us to use this idea that want to encourage the difference between how agent performed now and how it usually performs. And even in a strich with agent got poor reward but that is way higher than what it usually gets in the state, you would encourage this improvement rather than discourage it and in comparison to other situations. The only question remaining is how do we get this V function? So know that's if we get this V function everything is great, now we have to somehow estimate this function for your particular environment. Think Atari for now, let's say you applying a breakout and you want to estimate the V function, how do you do that? Yeah, you would approximate, or use some other tricks but these tricks are usually specific to an environment. So what you do is you train a natrobey that has those two outputs. First, it has to learn a policy because otherwise there is no point to do anything else. The second part is that it estimates the very function. Speaking on a deep learning language, the policy is a layer that has as a many units as you have actions, and it uses soft marks to return a valid from both distribution. The key thing here is just a single unit, it's so densely with one neuron, which is non-linearity, just like you have with Q learning functions with the DQN for example, you then have to perform two kinds of updates. First, you have to update the policy, in this case you believe that your V is good enough and you use your V function to provide this better estimate for the policy gradient. You then ascend this policy gradient over your new other parameters. The second important task, that you have to refine your value function, this is done in a similar way as you have done in deep Q-learning, the deep SARSA or any other approximate value-based algorithms. You just compute the mean squared error, over all the tuples of SA errors as prime that you get, and this way you can dosh the expectation of the value function. And of course, make some improvements based on the kind of DQN enhancements, but those are usually slightly hard and they don't bring as much reward here as they bring in the DQN keys. What you do is, you simply learn those two functions interchangeably, so you compute the gradient of this number J ascendant, then you compute the gradient of this mean squared error and you descend it over the parameters of your value function the critic. Now, another important part is that, you have to refine your V function as well. This case you use an algorithm which is very similar to how you train DQN before. You simply take your Sa error as prime tuples, you can compute the temporal difference error, in this case, it means squared error, you minimize it by following the back propagation for the neural network. The deal here is that, if you take a lot of samples you'll convert a mathematical expectation, this way you'll get the kind of true V function. So, it's also important to know that in this case you're not that much reliant on the V on the value-based part of your network, as you were in the DQN, because even though your value function is not very accurate at the beginning, you can still subtract it. Remember you can subtract anything which is not dependent on your action, and V function is definitely not, by its definition by the design of your neural network. In this case, even a poorly trained value function will bring you some improvement on how your agent trains. As I've already mentioned, this other FFM is called the acro-critic algorithms. This case you have to cast the actor and the critic. The first hat the blue one is the actor. It picks the actions, it just models the probability of picking action a, and state s, and this is your policy. The second hat is the critic. The idea of critic is, basically, it estimates how good your particular state is, it basically it's used to help training your network. The idea is that if you train your actor and critic hats interchangeably, you'll not only obtain this value-based hat as a side quest, that allows you to measure the value, you'll also obtain algorithm that improves the convergence of your policy based particular actor, by using an order of managers. We'll see how those two methods, the reinforce and the actor-critic compare on practical problems later on in this lecture.