Here are the results of using TD Learning to learn values for our problem of the

rat in the barn using a random policy. Each of these plots shows the value for

the states A, B and C, as represented by the weights wA, wB and wC.

The jagged lines show the values as a function of trial number.

And you can see that the values for each of these locations A, B, and C jumps

around a bit, but the running average, as represented by the dark line, converges

to the right answer. So, 1.75, 2.5 and one, were the values

that we calculated for A, B, and C on the previous slide.

So, indeed The temporal difference learning rule appears to be learning the

correct values for each of these locations.

Now why are these values jumping around so much well it's because we've set the

learning rate epsilon to a high value of 0.5 and that speeds up the learning the

process but it will also cause your estimates of your value to jump around a

bit. Now why did we go through the trouble of

finding the value of each of the states? Well, here is the answer, as observed by

our astute friend, the friendly rat. Once you know the values for the states,

you solve the action selection problem. Here's why.

If you're given the choice between two different actions that lead to two

different states, then all you have to do is pick the action that leads you to the

higher valued state in the next time step.

Let's see if this works in our example. Well, as you might have guessed, it does.

Let's consider the action that we should take in the location a so we have 2

possible action to go left or go right and all we have to do now is to look at

the values associated with the next states if we each of these respective

actions. So if we take a left we end up in the

state B and we can look the value which is the expected reward we get.

In this state B which as we computed in the previous slide is 2.5, now similarly

if we take the action right then we end up in the state C which has as we

computed earlier a value of one, so that's the expected reward that we might

get if we move to the state C. So given that we have these two possible

states that it would move to, if we take the action left or the action right, the

obvious choice here then, is to choose the action going left, and that will make

us go to the state b, which has the higher expected reward or value.

The important point here is that we're using values as surrogate immediate

rewards. So what do we mean by that?

Well consider the fact that in locations B and C, we do not get any immediate

rewards but we can compute the value which is the expected reward at B and at

C, and we can use the value as a surrogate.

For the immediate reward, and so we can use the value to guide our selection of

action at location A. This leads us to the important result

that a locally optimal choice here leads to a globally optimal policy, as long as

we have a Markov environment. And by Markov we mean that the next state

only depends on the current state and the current action.

This important result, which we can rigorously prove, is closely related to

the concept of dynamic programming first proposed by Richard Bellman.