Now, let's see how all those upsize and downsize convert into an actual learning algorithm. There is this super popular implementation of advantage actor-critic called the asynchronous advantage actor-critic. We'll cover this asynchronous part in a moment. Now, since it's an actor-critic method, it has a few ramifications that prevented from using some of the tricks of studies so far. For example, it's basically restricted from using the experience replay. Since actor-critic is on-policy, you have to train on the actions taken under it's current policy. You feed it actions that are sample from experience replays says, actions actually played an hour ago. You will probably have a situation where you feel that with actions that are not the same particular strategically, not following the same idea that it has learnt over this hour. For example, if an hour ago, your agent was not able to say, float in the Seaquest Atari game, we will not see your algorithm improving on this particular aspect until those actions gets thrown away from the experience replayed before. Now, as this become to your experience replay, we have to somehow mitigate the problem of having sessions that are not identically distributed and independent of each other. And this is an analogical study to which basically says that if you have enough parallel sessions, you can pretend that they are ideal. The situation here is that you have to expand and see 10 clones of your agent, which only is the same set of ways, but they play independent replicas of your environment. And of course, they have to take actions by sampling from the policy independently. If you go and play like this for say 1000 situations, you'll probably faces situations where of your independent replicas, they are in completely different states in the environment. This happens because at every particular junction, at every step, you have to pick an action and you pick actions for older replicas independently. So, you're very likely to divert at least some action if you play for long enough. This will result in the fact that some agents will probably have to terminate early because they've taken inferior actions and they are terminated with a failure, like when they've lost an Atari game. And now, they are starting over and they're at the very beginning. Other more likely agents will be at the middle or the later years of the game. And since the games usually have different particular states in which you can finish, they'll probably be in a different trajectories as well. So, you'll have for more or less not particularly independent, but at least more or less that is distributed sample from the environment and this is how on-policy methods deal with the problem of IAT. To think faster, this algorithm actually uses another idea from the field of distributed parallel computing. Should those independent replicas of the environment, since they are not required to interact that often, they assign to different processes. This basically means that you have your multi CPU or your multi GPU machine that has the same twinge of 30 CPU cores and basically, you're on a parallel instance of an agent on a game in each core. Now, then you can update the weights of your agents using this particular core. And then, synchronize them periodically to prevent them from diverging too far. Another peculiarity of this A3C algorithm, let's use some of the booms from the parallel computing to speed up the training process. Basically, you can assume that if you're training on more than a server, what you have is you have say, 30 or even more CPU cores that are available for you to train in parallel. What you can do then is you can say, you can take your 30 or whatever amount you have of parallel playing sessions, each consisting of an agent, it's weights in an environment. You can assign all of those sessions to different core. Think of simply training all those cores in parallel, you can use the pocket version of A2C and then check actor-critic on each particular core. Then, you'll have to periodically synchronize weights to make sure that they don't diverge too far. So, this is what makes A3C have a lot of independent processes. You have them trained a little bit. And then, in some fixed or well, in some limited time bills, you basically take ways from all the cores and you synchronize them to make sure that they are still doing the same thing. Now, the final and the part which is yet under explore within the scope of our course, the fact A3C is very famous for the particular condition called A3C plus LSTM. As you might have guessed from the beginner course, this basically means that the agent here uses some recurrent memory. The generality behind the recurrent memory is that they'll ask your agent to memorize some of the observations, so that it can improve it's policy based not only what it can see right now, but what it have seen previously. So, it's not to be significant from solitary games and we'll cover this in more detail in the final section of this week. Schedules, the asynchronous advantage actor-critic. Now, in the original article that was proposed and the authors compared this method to other value based methods which are original. All those methods were training the same parallel asynchronous way. So, basically, you have parallel workers that do either actor-critic updates, or queued learning updates, or whatever, and they synchronize ways time to time. So, in this case, what you can see is that in many environments, the asynchronous actor-critic has tendency to both converge faster in the initial phase and sometimes get the better final performance. So, it not only does stuff faster, but it also does stuff better in the long run. They knew that all those implementations, they have one common benefit. What they do is they really train much faster than the usual principally, probably that you have say, tens or maybe hundreds of cores. Even if you use a modern server, this is not a far stretch hypothesis. Sometimes you have say, dual whatever high end processors that have say 64 cores. Now, what this allows you to do is it allows you to do super fast training, but this doesn't apply to all possible sessions. In entire games, it's very cheap to simulate all those environments, but make sure you're doing something more ambitious. So, you are actually trying to train a physical robot. So, you have this piece of hardware which runs around the special polygon and you want to apply this parallel asynchronous knowledge actor-critic approach to it. The issue here is that if you do so, you'll have to train all the parallel sessions and each of them requires a parallel robot, so it will be much more expensive to do so. But in Atari since the simulation is very cheap, this method is basically the almost the state of the art so far. It is the state of the art if you exclude the methods that are super complicated and adapted to a particular super details, super special case issues.