0:00

Welcome back. In the previous lecture, we learned about

super wise learning, but humans and animals in general do not get exact

supervisory signals when they're learning to find food in a maze or leaning to ride

a bike or play the piano. We learn by trial and error and we might

get rewards by punishments along the way. For example we might find food at the end

of the maze. Or get praise and critisisms from the

piano teacher or you might even get an amazing reward and the end of a course, a

certificate of accomplishment. This leads us to the last type of

learning that we'll consider in this course, reinforcement learning.

0:36

In reinforcement learning we have an agent such as a rat interacting with an

environment, such as this barn. The agent at any point in time t can be

in a state denote by ut, where ut is a vector that could denote, for example,

the location of the rat in the barn. And the agent may get a reward at any

point in time t and this reward is denoted by rt.

And rt can be a scalar value that could be.

Positive or negative and the reward might denote for example the amount of food

that a rat gets in a particular location in a barn or it could represent a

particular nasty encounter with a cat in some location in the barn.

1:20

Now the problem facing the agent or the rat in this case.

Is selecting the best actions that will maximize the total expected future reward

and this is the problem reinforcement learning.

Perhaps the earliest results in reinforcemont learning were obtaied by

Pavlov in this experiment with dogs. These are the classical conditioning or

Pavlovian conditioning experiments. So what did Pavlov do?

Well he rang a bell and followed the ringing of the bell with some food reward

for the dog. And he repeated this association of bell

followed by food reward many many times here's what he observed.

He observed that every time he rang the bell the dog began to salivate as

depicted by this animation here. So what do you conclude from this?

You can conclude that the conditioned stimulus, which in this case is the bell,

predicts the future reward, which is the food.

2:24

So the problem faced by Pavlov's dog's brain is this.

How do we predict rewards that are delivered some time after a stimulus such

as the bell is presented. Well, let's see if we can formalize this

particular problem. So what we are given then are many, many

trials, each of length let's say capital T time steps.

And let's denote the time within one particular trial using this little t.

And let's denote the stimulus that we might get at any particular time step as

ut, so for example ut might be the ringing of a bell or not.

And the reward, rt is what the animal might get at each time step t.

So this could mean that the animal gets a food reward at some particular time step

t or maybe it doesn't get any reward at all so rt can be 0 for some time steps t.

And here's what we would like. So we would like a neuron whose output,

vt, predicts the expected total future reward.

And so what we would like is for the output vt to be approximately equal to

the average over all the trials of the summation of all the rewards from

timestep t onwards until the end of the trial denoted by capital T.

3:52

Here's how you can get a neuron to predict the expected total future reward.

We can use a set of synaptic rates, wt and we can predict based on all past

stimuli, ut. So here is a network that can perform

this operation. We have to use what is called a tapped

delay line to feed in all the past inputs into this network.

And here is the output of the network. It's simply a weighted summation of all

of the weights with the past inputs. And you'll notice that this is nothing

but the equation for a discrete linear filter, so linear filtering strikes

again. And here are out standard trick for

learning the weights we can minimize an error function.

And here's the error function it's just the squared difference between the total

future reward and the prediction of the total future reward.

So how do we minimize the error function can we for example use gradient descent

and the delta rule as in the previous lecture?

5:23

Well the key idea goes back to Richard Bellman and his optimization method known

as dynamic programming. And the idea is to rewrite the error

function recursively to get rid of the future terms that are not available at

this time. So how does this apply to our problem?

Well here's the problematic summation of future rewards.

And we can rewrite that as rt. Plus the sum of all the future rewards.

And here is the key jump. We replace the summation of all the

future rewards with the prediction by our network of the expected future reward.

Now we have an error function where we have all the quantities that are

available to us. Between time step t and t plus 1 and so

if you can minimize this error function now using our old friend gradient decent.

When we do that we get what's called the temporal difference rule or TD learning

rule and this was orginally proposed Sepheran Barto in the 1980's.

And here's what the learning rule looks like.

So the rates for each time step, tau, are updated according to these three terms.

So there's the learning rate as before and then there's a prediction error term

given by delta and there's also the input.

Now why is this learning rule called temporal difference learning?

Well, as you can see in this term, we have a temporal difference between the

prediction at t + 1 and the predication at time t.

7:00

Well if skeptical about the temperal difference learning rule.

I wouldn't blame you. If it's unclear that replacing the sum of

future rewards would a prediction can actually work in practice.

Well hopefully this example will convince you.

Suppose we take the example Pavlov's dog. And suppose that the bell, the stimulus

is given at time step 100 and the reward, the food is given at time step 200 within

any given trial. Now let's look at the situation before

and after learning. So in the case of the stimulus and the

reward there is no difference before and after learning.

Because the stimulus and the reward are presented at time step 100 and around

time step 200 in both of these cases. But look at what happens to the

prediction. The prediction of the network is all

zeroes initially but after learning. There is a prediction of two starting at

the time of the stimulus at time step 100.

And why is it two? Well two is the total reward that is

delivered around that time step 200. And so you can see that the network has

learned to correctly predict the total reward it expects starting from the time

of the stimulus. It's also interesting to note what

happens to the delta, the prediction error.

So you can see before learning, the delta is high around the time of the reward and

so that's because the network is predicting all zeroes, whereas the reward

is delivered at time step 200. And so the prediction error is now going

to be essentially just the reward but look at what happens after learning.

So after learning around the time of the reward, there is a delta of 0.

So there is no error in prediction but the prediction error has shifted now to

the time of the stimulus. And that's because what this reflects is

just the value of the prediction. So v of 100 minus the value of the

prediction at the previous time step which is v of 99 so this reflects the

prediction error given by v 100 minus v 99.

Now this plot shows how delta changes as a function of trials so at the very of

trial number 1 we have a bump. Around the time of the reward around 200

and that's identical to this plot here. But as the network is exposed to several

trials, this bump moves backwards in time until it reaches a value of 2 at the time

of the stimulus. And that is exactly the situation we have

here. So that is where the network now has

learned to predict a value of 2 for the total reward expects from the time of the

stimulus from time step 100. Now here are some intriguing results from

Wolfun Scholtz and colleagues. They recorded from the ventral tegmental

area of the midbrain of a monkey. And the neurons in the ventral tegmental

area or VTA are dopaminergic, which means that they transmit the neurotransmitter

dopamine to different parts of the brain. Now dopamine has been implicated in

reward based learning and it is also involved in various addictive behaviors

such as addiction to drugs like cocains. In the experiments, the monkey was

presented with a stimulus, for example a sound and then the monkey had to press a

key. A short while later the monkey was

rewarded and here's what the neurons in the ventral tegmental area did in this

experimental paradigm. Before training, the neurons in the

ventral tegmental area had a very high firing rate around the time of the

reward. But after training, the neurons no longer

responded near the time of the reward. They started responding around the time

of the stimulus. Now what does this remind you of?

11:03

That's right these two plots look very similar to the plots for delta, the

prediction error from the previous slide. So what this suggests is that the neurons

in the Ventral Tegmental area may be encoding reward prediction error.

And that would explain why you have a big response before training around the time

of the war. Where as after training the response is

very small because the reward prediction error now is very small since the animal

has learned to predict the reward. And then the reward prediction error is

larger around the time of the simulis because now that response encodes the

prediction error vt minus vt minus 1. Which is similar to the error that we saw

in the previous slide, v100 minus v99. Now here's an interesting question.

What do you think will happen if we don't give the monkey any reward at the time

that it expects to get the reward? Well the monkey is probably going to

think it's a cruel joke but what do you think is going to to happen to the

responses of neurons in the ventral tegmental area.

12:10

Well that's right you would expect to see a negative error because the prediction

was not fulfilled and that's indeed what will Schouls and colleagues observed.

In the ventral tegmental area neurons. So when there was reward, you have this

response, which is similar to what we had in the previous light.

But when the reward is omitted, there's no reward, then you see a dip in the

firing rate of the neurons. And that corresponds to a negative error

in the temporal difference learning model of the dopaminergic cells NVTA.

Now that you know how the brain might learn to predict rewards you might be

asking the question how does the brain learn to select actions that maximize

future rewards. This will be the topic of our next

lecture. Until then.

Sy Chin and good bye.