[MUSIC] So now let's speak about how we can solve some of the instability problems discussed in the previous video. In particular, DQN has some tweaks against three problems. The first is sequential correlated data, which may hurt convergence and performance. The second is instability of data distribution due to policy changes, because policy may diverge and oscillate. And the third one is the general program of unstable gradients. And this error in part of, because of differentiable Q function and unknown scale of rewards. So how does DQN deal with these problems? Sequential correlated data are embedded with the so-called experience replay. And instability of data distribution due to policy changes is overcome by using target networks. The last unstable gradients are almost eliminated with reward clipping. These three techniques are rather influential, especially the first two. And now let me go into to detail about each of these techniques. The experience replay and give you an intuition of why does it help against correlated samples. Let's first look at the Q-learning update. Please note that to make such update, we need only tuple S, A, R, and S', that is the next state, nothing more nothing less. If the essence of the problem comes from the usage of consequent samples for model update, let's correlate the sequence by weighing the model using the correlated data. That is, we can collect tuples S, A, R, and S'. Then for training sample from such collection, some amount of such tuples, and perform updates according to a sample mini-batch. That is the complete idea of experience replay. So to illustrate, the algorithm is as follows. We first store some tuples S, A, and R in a pool, then we sample some number of tuples from that pool say at random. And update the model of Q-function using this mini-batch. Then we act into environment, we interact with environment using epsilon-greedy policy with respect to current estimates of Q-function. We're updating some new tuples, S, A, R, and S'. And we add these tuples into the pool. And we repeat this in a loop until we are sure that we want to terminate the learning processes. Very simple, but why does it help? Well, if the pool is sufficiently large, then we effectively de-correlate the data by taking, in the update, different pieces of possibly different trajectories. What can possibly go wrong here? Well, such experience replay is possible only for off-policy learning. That is so because on-policy models implies that only new, fresh data coming from policy with the latest parameters is considered for learning. And our current parameters are different to those used to generate the old samples. Experience replay is a very powerful technique. It is used almost everywhere it can be used because of its properties. It smooths out learning and prevents oscillations or divergence in the parameters. Experience replay not only helps against correlations, but also increases sample efficiency and reduces variance of updates. It also is very easy to use for training on distributed architectures. That is, on cluster of machines. Another interesting observation is that the experience replay could be viewed as an analog of sample based model of the world. This view partially explains the effectiveness of this technique. However, experience replay is not completely free of disadvantages. It is very memory intensive. In the original paper, for example, authors used around 1 million of interactions, and that is a lot. Also, the random sampling from a pool is not the most efficient method of sampling. We might want to more frequently select the latest experience than the old one. Or we might want to use the samples from which our current policy will learn the most. This and other ideas were developed only afterwards by different researchers. So far, we have discussed the problem of policy oscillations, and have noted that experience replay partially solves this problem. Nonetheless, the instability problem is not eliminated completely with experience replay. Targets still depend on the parameters and error made in any target immediately propagates to other estimates. This dependence of target on parameters could easily broke the learning by introducing oscillations and positive feedback cycles. For example, we might want to update more the parameters responsible for low Q value estimate that corresponds to high target value. But by doing so and because of sharing the approximation parameters, we might also increase the value of the target. This in turn may increase the gradients to eliminate the increasing gap between estimates and target. However, in positive feedback loop, these large gradients will only make things worse. So how can we break the ties between targets and network parameters? The idea against such positive feedback loops proposed in the DQN paper was both simple and effective. Let just split the parameters of Q values and targets from the parameters of currently learning Q-function approximation. To highlight the fact that the parameters of target and approximation networks are different, we will refer to the former as target network, and the latter as a Q-network. Target network has precisely the same architecture as Q-network but has a different set of parameters, which are denoted by W to the minus on the slide. However, we cannot be satisfied with a simple distinction between parameters. That is so because if the target parameters, that is, W to the minus on the slide are not properly updated, an agent obviously will learn wrong action value function estimates. Thus, in the core of the target network break lies another crucial concept related to updating these parameters. In fact, they can be updated in either of the two ways. The first way that was actually used in the paper is to update these parameters in the so-called hard way. That is, once in a while, say, every 10,000 time stamps, assign the parameters of current Q network to parameters of target network. You can think about this type of updates as about creating snapshots of Q network, and updating these snapshots from time to time. Another way to update target network is to update the weights at every time step but use a very small update rate. In the simplest form, this idea corresponds to the parameter of the target network being the exponential moving average of the Q-network parameters. This type of updates allows it to learn a little bit faster but is not completely free of policy oscillation problems. Well, now you know two of the most important tricks to improve stability of neural Q-learning. Now let discuss the last trick. The last trick introduced in the DQN paper was reward clipping. This trick was designed against the problem of unstable gradients. This problem is in part inherent to reinforcement learning because of average changes of action value function. In addition, in any environment, we don't know the scale of rewards beforehand. And this scale may vary across states and actions significantly, contributing to the gradient instability. The trick proposed was in clipping the rewards to the range from -1 to 1, which turned Q-values to be less peaky and stabilized gradients a bit. However, this trick also introduced a rather crucial drawback. When clipping rewards to the range of -1 to 1, an agent loses an ability to differentiate between a good reward, say a reward of 1, and a very good reward, say a reward of 100. It is also worth to mention that this trick, unlike the previous two, wasn't adopted by future researchers because of its drawbacks. Nevertheless, sometimes it may be helpful. So now you know all the details of the DQN algorithm, and to conclude, let us now see the results of applying this algorithm to the famous Atari game called Space Invaders. Aren't they impressive? [MUSIC]