In this week, we are going to talk about

the connection of reinforcement learning and supervised learning.

We will also discuss the difficulties which may arise when using supervised learning,

and reinforcement learning tasks.

Some problems that arise in

reinforcement learning are well known in machine learning community.

These problems are not specific to reinforcement learning,

but pose a significant difficulty in practice.

So, let's discuss these problems one by one.

Probably the most obvious problem is the so-called, curse of dimensionality.

That is the number of states and actions grows

exponentially with the number of variables used to describe them.

In fact, even in relatively small problems,

a number of unique combinations of state and action,

may be combinatorial oblique.

And for each of these combinations,

we need to learn a Q function.

In fact for example, in Atari games,

there are general Aidyn actions available,

and the screen of the size 210 times 180.

This doesn't seem scary,

but think of all possible state and action combinations.

They're number is much,

much larger than the number of atoms in a universe.

And of course, for dealing with such state and action space,

we should use methods which are sufficiently sample efficient.

That is they must be able to generalize well to unseen pairs of state and action,

and use all the available information in the data gathered so far.

And of course, we need these methods to be accurate.

That is to make meaningful decisions.

And meaningful decisions in turn require the methods to be very flexible.

However, any model even if it is proven to be

universal approximator is limited in

its modeling power when you limit the number of its parameters.

Thus, even universal approximators in practice can be limited.

Flexibility of the model raises another crucial question

about what a model is capable of.

There is the problem of overfeeding,

and underfeeding of the data,

also known as bias-variance tradeoff.

Ill treatment of this problem can spoil the training,

and make the model close to absolutely useless.

These problems are worth to keep in mind,

but they're not specific to reinforcement learning,

and thus we will not go into details about this issue.

Instead, we are going to talk about difficulties

of supervised learning applications in reinforcement learning.

We are going to touch several different problems.

First, we will discuss the specificity of trading data,

then we will turn our attention to the issue of the non-stationary data.

And finally, we will briefly discuss one of

the most complex and open theoretical problems in reinforcement learning research,

which is called the deadly triad.

Although we separate the problems between each other,

and try to discuss them in a relation,

in practice they often come all at once.

This happens because the problems are deeply interconnected,

and rarely come apart from each other.

So, why do we say that training data in the reinforcement learning is unusual?

Well, there are many reasons for this.

The first reason is that training data in reinforcement learning is

by no means independent and intensively distributed.

Why is it so, and why is it important?

Consider for example, the Space Invaders Atari game.

While an agent has to prevent an invasion of

aliens coming from top to the bottom of the screen,

an agent could hide behind one of these three rocks,

but aliens can break the rocks.

Now, pretend that an agent has managed to find a quiet place just

under the rock where it is invulnerable to the bullets of the aliens.

And they just spend, let's say,

ten thousands of time steps,

until aliens eventually break that rock.

Guess what will happen next?

An agent forgets almost all he's supposed to learn about that environment,

because of online entity updates.

Now, he knows only how to hide behind the rock.

He knows that this is a good thing to do.

Nothing more, nothing less.

And after the rock was broken,

an agent must start learning almost from scratch.

This sequential correlated fraying data could,

and sometimes is a problem in practice.

This correlation between symbols and training data,

may be harmful not only because of forgetting useful features,

and already accumulated knowledge.

Basically, very many methods in supervised learning rely

heavily on the training data being iid.

When this is not the case,

learning can be either less efficient,

or break down completely.

To begin an intuition why learning can be slowed down by strong correlation.

Think of in what case there is more information?

In case of two independent samples,

or in case of two perfectly correlated samples?

Obviously, the former is more useful for learning from.

Another interesting fact is that SGD loses its convergence properties.

In case of non-independent,

and identical distributed data.

So, we are no more guaranteed to converge to the optimum.

The second reason why we say that the training data is unusual,

is the fact of data dependence on the current policy.

In reinforcement learning, we not only observe the actions chosen by the policy,

but we also observe states and rewards,

that are influenced by the policy.

More to say, when we update the parameters of policy,

we also change not only the distribution of factions,

but we also change the states and rewards,

on which we will further update our agent.

These reasons seem rather obvious,

and to some extent they are.

However in practice, there are environments where an agent sometimes learns the behavior,

that changes the incoming data stream in a way that yields

an agent almost incapable of unlearning such vital behavior.

Literally when this happens,

the learning process starts almost from scratch.

Actually, a policy changes the data stream not only due to policy updates,

but also due to exploration.

This agent constantly experiment with the data stream,

and this also adds it's contribution to the learning stability.

The third reason of why the data in reinforcement learning is a

bit unusual is non-differentiability.

What I mean by this is the behavior of action very function,

when we wiggle a little bit state S, or action A.

And then, if two states are close,

for example, in the L2 Norm,

it doesn't mean that the values of that states,

either state values, or faction values are close.

In fact, that is often not the case.

Consider riding the car in a highway.

When the car is arbitrary close to another car,

everything works great, no collision happens.

And the value of that state doesn't experience any negative effects.

But as soon as a car touches another car,

an accident happens and the consequences of such even light touch may be arbitrary bad.

The same is true for the state success of in time,

they may, but also may not have close values.

Moreover, the same is true for action.

Even if action is continuous,

slight change in action may change the value of

state action function by arbitrary amount.

All these lead to unstable gradients and data inefficiency.

Besides that, if you use temporal difference targets,

you additionally become susceptible to error propagation.

That is if you make an error in estimating the value of particular state in action,

this error will propagate to estimates in states proceeding in time to current state.

That is so because for these preceding states,

we include the error in those estimate in the target.

And our learning procedure is designed to make

our estimate in the preceding state closer to our target.

We will see some tricks against this error propagation in future videos.

There is a handful of examples.

The first example is mount and car environment.

In this environment, one controls the car and needs to reach

the goal which is on the high hill on the right.

The problem is that the car is not powerful enough to reach that goal directly,

instead it should oscillate in a forward backward fission to reach the goal.

On the rise, there is a plot of minus the value of state,

which is described by two float numbers,

that is position and velocity.

Notice that almost for each position there is

an abrupt change in the value corresponding to some velocity.

This mean that in a relatively high velocity may have

a low value because the car with its velocity can come only very close to the goal,

but not reach it.

And a very small increment in the velocity can cause the car to reach its goal,

which in turn corresponds to high value,

because it receives a higher reward in that moment.

The second example is about the helicopter.

Just pretend an agent learning how to control

a helicopter flying between the trees in a forest.

Almost in all positions some of trees may be very close to the helicopter,

and it is a key,

it is a forest there.

But if the helicopter change its position a bit it

could touch the closest trunk, the closest tree.

And that thresh will cause an instantly negative reward,

which in turn will make barely function abrupt in the vicinity of trees trunks.

So we have discussed why the reading data in reinforcement learning is unusual.

But there are also some interesting problems caused by the learning task itself.

One of the most important specificity of a learning task

in reinforcement learning is not stationarily.

As you might remember, almost all algorithms in

reinforcement learning can be viewed as generalized policy thracian.

That means that we constantly update our estimates,

that are defined with respect to some policy which generate a data.

However, when we like generalize policy direction change to policy,

we also invalidates estimates of Q-values,

and this can cause problems,

because the change make all of our targets,

all of the goals, that is G of S and A to become invalid.

This is true not only for temporal difference targets but also for Monte Carlo targets.

The allies are no longer apply because

the changed policy updates different cumulative rewards,

and this is really the problem in practice.

This in turn causes nonstationary can in turn

cause inducing problems such as oscillating behavior.

For example, a little change in Q-values can cause drastic changes in grible disappear.

This in turn can cause drastic changes in training data,

which make the estimates of Q function variant precise.

And that is so just because the all values were learned from a completely different data.

This imprecision causes large gradients and large update in Q-values,

but large updating Q-values causes large changes in policy.

I hope you have noticed the feedback loop which may break learning completely.

To make things a little bit more complex,

note that environment in which agent performs his decisions can be nonstationary too,

like the real world the state's transition and

even rewards can change while time passing.

Learning in a nonstationary environment is probably

one of the most complex tasks in reinforcement learning.

The last but surely not the least concern I want to touch,

is the so-called the deadly triad.

It seem partial explains the essence of the problem.

There are three ingredients which if put together may lead to model divergence.

That is divergence of pareto to infinity.

So what are the deadly ingredients?

The first is enough policy learning,

that is learning parameters as one policy,

while following another policy.

The second is bootstrapping.

Process of updating guess towards a guess.

Dynamic programming and learning on

temporal different targets are all examples of bootstrapping.

The third ingredient is function approximation.

That is using a model with number of parameters smaller than number of states.

And one more time,

if we add all these ingredients together,

we obtain an algorithm which may diverge.

What do I mean by may diverge?

In fact in scientific literature there is a set of toy counter examples which

show how in particular maker decision process

particular algorithm diverges no matter what.

By the way there is a counter example for Q learning too.

However, if we remove any of the three ingredients no divergence is possible.

Think about it for a moment.

There exist examples of almost unconditional divergence.

This divergence is not connected with

whether an algorithm samples or performs full with backup.

It is not connected with degree of brilliance software policy,

whether a task is a control problem or a regression problem.

It is even not connected to complexity as a model,

because in simple linear models are shown to be susceptible to divergence.

This counter examples that can be found in

the scientific literature are sort of artificial and they are a little bit involved.

So we are not going to spend there more time.

This divergence to infinity almost never happens in practice and there are

also a bunch of practical tricks which we believe my stabilize things up.

But note that community of

reinforcement learning researchers still

miss the deeper understanding of this divergence problem.

Its origin is it's vertical principles and how to develop methods against it.

So, be aware of this problem existence.

If despite we are not going to integrate deal with this problem,

you are highly encouraged to look through

the suggested reading to obtain more information about the deadly triad.

So, all these problems do not seem like the ones which are possible to solve.

Well they are really hard, but not fiddle.

We can overcome most of them with practical tips and tricks,

and that will be the next topic to discuss. Stay tuned.