In this video, we get to see episodic sarsa in action on a continuous state domain. We'll visualize the learn values and gain some intuition about the solutions learned by sarsa. By the end of this video, you will gain experience analyzing the performance of an approximate TD control method. Let's look at an example of episodic sarsa on a classic control domain called Mountain Car. Mountain car is an episodic task where the objective is to drive an underpowered car up the side of the mountain. Gravity is stronger than the car's engine, so the agent cannot directly drive up the mountain. The only way to escape is to first drive backwards up the left slope. Then driving down the hill gives the car enough momentum to drive up the right slope and out. The episode begins with a car in a random location near the bottom of the valley. It ends when the car reaches the flag at the top of the hill. To encourage the agent to finish the episode as quickly as possible, the reward is minus one on each timestep, and no discounting is used. The agent can observe the current position and velocity of the car. This is a two-dimensional continuous-valued state. The agent has a choice between three actions: accelerate forward, accelerate backward, or coast. For this example, we jointly talco the position and velocity to produce features. We use eight tiles, which means we use eight-by-eight grids for the two-dimensional input space. We use eight tilings, which means we have eight overlapping grids. We treat actions independently by using a stacked feature representation. We initialize the weights to zero. This initialization is actually optimistic. This is because a reward is minus one per step, so the values under any policy will be less than zero. In this problem, these optimistic initial values cause extensive and systematic exploration. Because of this, we can act greedily without any additional random exploration. Let's see what the value function looks like after we run the agent for a very long time. Ideally, we would like to plot the value function for every state. However, there is an uncountably infinite number of states. So we'll have to sample the set of states. We'll use the max value in each sample state. This number tells us the number of steps the agent thinks it will take to escape under its greedy policy. The reward is minus one for each step, so we negate this number to produce the agent's estimate of the number of steps to go from each state. Here's the agent's estimate of the expected number of steps after 9,000 episodes. This is a really long time for this problem, so we're pretty sure the value estimates are about as good as they're going to get. But they're not perfect due to function approximation. The green line represents the goal position which does not depend on the velocity. Near the goal, if the velocity is large enough, the agent can directly drive out. However, if the agent is near the goal but the velocity is too low, it will take many steps to escape. This was because the agent must first go back down the hill and up the left side again to gain enough momentum to escape. The peak corresponds to the start states where it takes around 120 steps to reach the flag. This green trajectory reveals the path taken through the state-space by the learn policy. The value function look pretty interesting, but let's look at some learning curves to get better insight into the speed of learning. Let's try sarsa with three different values of Alpha, so three variants of the algorithm. Here we are interested in the steps to goal, which in this case corresponds to the return per episode. We expect the number of steps per episode to decrease with learning. Lower on the x-axis corresponds to better policies, policies that can reach the goal in the fewest steps on average. As always, we average the performance over many independent runs. Here are the results. By 500 episodes, the number of steps per episode has roughly stabilized. All the learning curves exhibit the familiar exponential profile. The smaller step size parameter value of 0.1 results in slower learning, while Alpha of 0.5 allow the agent to learn more quickly and find a better policy over 500 episodes. Notice we divided each Alpha value by eight. If we are not using a vector of step sizes, we often scale the step size parameter by the norm of the feature vector. Here we use eight tilings in our tile coder. That means the number of 1's in the feature vector is always eight. As a simple exercise, prove to yourself that eight corresponds to the L_1 norm of the feature vector. Okay, that's enough for today. In this video, we evaluated episodic sarsa with linear function approximation in mountain car. See you next time.