0:17

Well lets try to formulize this problem. Here's the diagram for the reinforcement

learning frame. And as you would recall, we had an agent

interacting with an environment and at any point in time T, the agent measures a

state UT of the environment and also potentially gets a reward, RT, from the

environment. And, the agent can then execute an

action. 80 which in turn would change the

environment in a specific way and then agent gets to measure again a new state

ut plus 1 and potentially get a reward rt plus 1.

Now the phased by the agent is figuring out which action to take given any

particular state. And this can be formalized as the problem

of learning a state-to-action mapping, or a policy, as it is called in

reinforcement learning circles. And the policy is typically denoted by

the function pi. So pi is a function that maps states to

actions. Now, what kind of a policy do we want?

We would ideally like to have a policy which maximizes the expected total future

reward. So here's the mathematical expression for

that, and it's, you'll notice, exactly the same expression we had when we were

discussing temporal difference learning. So it's the expectation of the sum of all

the rewards we're going to get from time step T, all the way up to the end of the

trial times state capital T. Let's try to understand the problem in

the context of an example. Here's our friendly rat and let's suppose

that the rat is in a barn. Given by this maze here, and different

locations in the barn, contain different amounts of food reward.

So let's suppose that this location here contains a food reward of five.

This location contains a food reward of two and these two locations contain no

food reward at all. Now, we could be mean to our rat and make

these two numbers negative, which would mean that these two locations contain

predators, such as cats, but let's be nice to our rat, so no predators in this

barn. Now we have marked three locations in

this barn A, B, and V. And the rat has to decide whether to go

to the left or to the right at each of these locations.

So, the reinforcement learning problem for the rat, then, is to choose an action

at each of these locations. So as to maximize the expected future

reward. Now in the context of the formal

reinforcement learning framework, can you tell me what the states and actions are

in this problem? That's right, the states are the

locations and the actions Are the two actions, go left or go right.

Here's another question that i would like to ask you.

If the rat chooses to go left or right uniformly at random.

In other words it's executing a random policy, then can you tell me what

expected reward would be for each state? The expected reward for each state is

also called the value for that state. So can you tell me what the value for

each of the states A B and C would be if the rat was executing a random policy of

going uniformly at random left or right at each of these locations.

3:35

Well here's the answer. Let's first look at the value for B, so

if you're in location B, then if you uniformly go at random left or right,

then you're going to be half the time in the location with the 0 and half the time

in the location with the 5, and therefore the expected reward is 2.5.

Similarly if you're in location C, then you have an expected reward or value of

1. And interestingly, if you are in location

A you an use the values you just computed for B and C, because those are the next

states following A if you go left or if you go right.

And so if you go left or right uniformly at random then you have the expression

half of the value of B plus half of the value of C.

Adding them together gives you the expected reward or the value for the

location A. Well, that was easy, but how can you

learn these values in an online manner? Remember that the rat Is experiencing

these locations sequentially and therefore, the rat has to learn these

values, from experience, as it is going through each of these locations.

How can we do that? The answer, as you might have guessed, is

to use our old friend, Temporal Difference Learning or TD Learning.

So let's represent the value of each state u, v of u by a weight w of u.

Now we can update w of u using the temporal difference learning rule.

So epsilon as you recall is the learning rate.

And here is the prediction error from the temporal difference learning rule.

And you can see that the prediction error term contains both the reward that you

get at the state u as well as the prediction of the expected reward, the

value for the state u prime. So u prime here is the next state that

you get after taking an action at the location u.

And vu, of course, is the prediction of expected reward, it's the value for the

state, u. Now let's look at what this u prime is.

So, if you are in a state, u, and you take an action, a, then, you're going to

end up in a state u prime. So, specifically, in our example here.

If we have a state A that we're in, and if we take an action go left, then the u

prime, the next state, is going to be the state B.

6:03

Here are the results of using TD Learning to learn values for our problem of the

rat in the barn using a random policy. Each of these plots shows the value for

the states A, B and C, as represented by the weights wA, wB and wC.

The jagged lines show the values as a function of trial number.

And you can see that the values for each of these locations A, B, and C jumps

around a bit, but the running average, as represented by the dark line, converges

to the right answer. So, 1.75, 2.5 and one, were the values

that we calculated for A, B, and C on the previous slide.

So, indeed The temporal difference learning rule appears to be learning the

correct values for each of these locations.

Now why are these values jumping around so much well it's because we've set the

learning rate epsilon to a high value of 0.5 and that speeds up the learning the

process but it will also cause your estimates of your value to jump around a

bit. Now why did we go through the trouble of

finding the value of each of the states? Well, here is the answer, as observed by

our astute friend, the friendly rat. Once you know the values for the states,

you solve the action selection problem. Here's why.

If you're given the choice between two different actions that lead to two

different states, then all you have to do is pick the action that leads you to the

higher valued state in the next time step.

Let's see if this works in our example. Well, as you might have guessed, it does.

Let's consider the action that we should take in the location a so we have 2

possible action to go left or go right and all we have to do now is to look at

the values associated with the next states if we each of these respective

actions. So if we take a left we end up in the

state B and we can look the value which is the expected reward we get.

In this state B which as we computed in the previous slide is 2.5, now similarly

if we take the action right then we end up in the state C which has as we

computed earlier a value of one, so that's the expected reward that we might

get if we move to the state C. So given that we have these two possible

states that it would move to, if we take the action left or the action right, the

obvious choice here then, is to choose the action going left, and that will make

us go to the state b, which has the higher expected reward or value.

The important point here is that we're using values as surrogate immediate

rewards. So what do we mean by that?

Well consider the fact that in locations B and C, we do not get any immediate

rewards but we can compute the value which is the expected reward at B and at

C, and we can use the value as a surrogate.

For the immediate reward, and so we can use the value to guide our selection of

action at location A. This leads us to the important result

that a locally optimal choice here leads to a globally optimal policy, as long as

we have a Markov environment. And by Markov we mean that the next state

only depends on the current state and the current action.

This important result, which we can rigorously prove, is closely related to

the concept of dynamic programming first proposed by Richard Bellman.

9:42

Okay, let's put it all together and see if we can come up with an algorithm For

learning optimal policies in Markov environments.

And the algorithm that we're going to look at is called actor-critic learning.

And it's called actor-critic learning because there are two components.

The first one is the actor, and the actor component selects actions and maintains

the policy. The critic component maintains the value

of each state. Lets first look at the critic component

and see how it learns. The critic component can also be looked

upon as performing policy evaluation because it's evaluating the current

policy by finding the value of each state under the current policy.

Now do we find the value of a state u well we can first represent it using a

weight w of u as we did before and then we can apply the temporal difference

learning rule as we did earlier in this lecture to find the value of a state u.

So here is the prediction error term as before.

And so the prediction error term is then used to update the weights wu for each

state u and that allows the critic component to compute the value of each

state u. Lets now look at the actor component.

The actor component as you recall selects actions and maintains the policy.

So does it select an action. It selects probabilistically by using

this function also known as the soft max function.

The soft max function is basically an exponential of a Q function which is

basically the value of a state and action pair.

So we'll come to that in just a minute but think of as a function very similar

to a value of a state except that now we're computing the value of a state

action pair. So the soft max function looks at the

value of any given state action pair and it runs it through this exponential.

Beta here is some fixed parameter so when you divide it by the sum of all the

exponential your normalizing it so that this probability the action for any given

state sums to 1. So given this set of probabilities for

any given action in any given state we can now select the action according to

this probability, and that gives us, the action that we execute.

12:17

You might be wondering why we have to use a probabilistic method for selecting

actions. Well it let's us address what is known as

the exploration, versus the exploitation dilemma in reinforcement learning, and

what is that? Well, consider the fact that early on at

the very beginning of learning these Q values might not be very accurate because

you have not experienced the environment very much.

So what you would like to do at that stage is explore the environment, and so

having such a soft maximum selection method lets you explore the environment

if the beta values are small. And, as you've learned, better and better

of values of Q, through exploration you can then increase the value of beta, and

that will tend to then pick the actions that have the higher Q values.

So then, you get more and more towards a deterministic action selection process

that favors more and more, the actions that have the higher Q values.

13:17

Once you've selected the action area of the state u, you can use this learning

rule to update the q values for all your actions.

A prime. This learning rule is quite similar to

the learning rule for w, and it also uses the reward prediction error, except that

now we multiply the reward prediction error with this term here which uses the

direct delta function that we all know and love.

And as you recall the direct delta function is going to be equal to a value

of 1 when a prime is equal to a, and a value of zero when a prime is not equal

to a. So the effect of this term is to multiple

the reward prediction error with a positive number when A' is equal is A and

a negative number when A' is not equal to A so the overall effect then is that the

Q value for A is increased if the action A leads to a.

Greater than expected reward and then the Q value is decreased if the action leads

to a less than expected reward. Now the Actor-Critic Learning Algorithm

proceeds by repeating the steps one and two above.

And under reasonable assumptions, it can be shown to converge to the optimal

policy, Now let's see if it finds the optimal policy in the example of our

friendly rat in the barn. Yes it indeed finds the optimal policy as

it turns out for our barn example. Here is the probability of going left.

At each location in the bar and as you can the probability of going left is

initially 0.5 for each of the locations A B C as is the probability of going right

but after several trials the algorithms assigns a probability almost close to 1.

For going left at location A. So that means that the action that's

favored is going left at location A, and now at location B the algorithm assigns a

probably that's quite close to zero if not equal to zero and that means that

going right is highly favored and so that is the preferred action at location B.

And now if you look at the location C, you can see that there is a gradual

convergence towards assigning a high property or a property close to one for

going left at C. Now can you tell me why is takes the

algorithm a longer time to find that the best action at location C is left.

15:48

Well that's right, it's because of the exploration exploitation trade off that

we observed in the previous slide. And so the algorithm tends to go left

more often at location A, and so it very rarely tends to go to the right.

And so it does not get to experience the state or the location C.

Very often, which is why it takes a longer time to learn, the value

associated with C, and therefore it takes a longer time to realize that the best

action to take, at location C, is in fact, going left.

16:24

Now researchers such as Andrew Barto have suggested, a mapping between the

components of the actor critic model. And the components of an important

structure in the brain known as the basal ganglia.

The dashed box on the left represents the components of the basal ganglia and the

dashed box on the right represents the components of the actor-critic model.

Note that we're using a hidden layer in this case to implement a multi-layered

network for mapping. State estimates to values, and state

estimates to actions. We can see that there's a rough

similarity between the components on the left side and the right side.

For example, the DA here on the left side stands for the dopamine signal.

And we already saw in the previous lecture that these dopamine signals Are

very similar to the temporal difference prediction area signal that we see in the

temporal difference learning model. Now, this is ultimately a very abstract

and high level model of basal ganglia function, but perhaps, it could serve as

a starting point for more detailed models.

17:32

For more details on these actor critic type models of the Basal Ganglia please

see the supplementary materials on the course work site.

I would like to end the lecture by noting that reinforcement learning has been

applied to many real world problems, and as a grand finale for this lecture.

I would like to show a video of autonomous helicopter flight based on

some work by Andrew Eng, Peter Abel, and others at Stanford University.

In this case, the reinforcement learning algorithm learned a policy based on a

dynamics model for the helicopter and a reward function learned from human

demonstrations. If you're interested in more details, I

would encourage you to go to the website shown at the bottom of the slide.

Okay, so are you ready to see what reinforcement learning can do?

Here we go. [NOISE] Wow, that was an exciting way to

end the lecture. That in fact ends the last lecture of

this course. Thank you all again for joining us on

this first online journey through the land of Computational Neuroscience.

Until we meet again, happy adventures and goodbye.