In this reinforcement learning tutorial, we will train a reinforcement learning model to learn to play Pong. The goal is to train an agent, in green, to win a game of Pong against an opponent in orange as shown in this video. An action consists of moving a paddle up or down, eventually generating a positive reward, or +1 if the trained agent wins a game or -1, if an agent misses the ball. Before knowing the result of the game, the model gets a fake label via a stochastic process, similar to tossing a coin to decide whether to accept the log probabilities of a neural network. And the optimal set of actions will maximize the sum of rewards along the game. An important event is when the agent wins or loses a game, at which point the algorithm modulates its rewards by giving more preference to early actions. The underlying model is a two layer, fully connected neural network that receives an input image frame, samples an action from the output unit, probability of moving a tile up or down, and updates its weights with gradients. When the outcome of a game is known, a reward is available, +1 if we win and -1 if we lose. And this becomes the policy gradients that encourage the decisions made in winning games via back propagation. As a result, these actions have an increased probability of being present in future games. This algorithm is called policy gradients, and has been shown to perform better then DQN, even by its authors, as it allows end-to-end training in the deep neural network. The errors represent the gradients of each action in the game towards the direction that increases their expected rewards. Since the loss values of the network are rated by their cumulative rewards, we expect the mean of the distribution to be influenced by a few gradients with positive rewards after parameter updates. Indeed, regions around green points seem to have small gradients, indicating the preference of the network to keep those weight values. On the contrary, other regions often have large gradients, meaning that the back propagation algorithm updates their values in seek of better regents for maximizing the overall rule board. As part of this tutorial, you will be guided through implementing a gradient policy algorithm in Neon to train an agent to play Pong. You will learn how to use the auto diff interface to perform automatic differentiation and obtain gradients given an op-tree. And how to implement a neural network that adjusted parameters and policies with gradients to get better rewards over time directly from experiences. This model samples the expected action, moving a pile up or down from a distribution when no gradients are available and high-uncertainty is in place. We start by importing all of the needed ingredients. We have included functions implementing the forward and back propagation steps shown here. The forward step generates a policy given a visual representation of the environments toward variable x, or the back propagation function updates the layers of the network modulating the loss function values with discounted rewards. This function sample action, chooses an action moving up with the probability that is proportional to the predictive probability. This is part of the magic behind reinforcement learning. We are able to optimize objective functions under very uncertain conditions using this stochastic process. Once you run the program, you will see the algorithm in action. It should be noted that because this algorithm is placed in scenarios with extremely high uncertainty, the model will take many hours before it produces meaningful results. [MUSIC]