0:00

In the last video,

we said that many optimal control problems in robotics,

supply chains, utilities, and so on,

in both large discrete state-action spaces or continuous state-action spaces.

In this settings, dynamic programming does not work

anymore and is replaced by reinforcement learning.

Let's now talk a bit more about what the reinforcement learning actually does.

Let's start with a brief recap of basic formulas for value iteration.

Now, remember, we discussed how to work as formulas

in our very first video of this course.

So let's meditate for a sec or

two on the second form in the right can side of value iteration.

Okay. What did we notice?

This term includes two things that

become highly problematic in many area work applications.

First, it involves the transition probability P,

S, prime, conditional S, and action A.

It means that we need to know the dynamics to solve

the dynamic programming problem for optimal control.

In other words, a researcher who wants to

use dynamic programming has to have a "model of the world".

Dynamic programming is a model based approach.

Second, it involves summation over all possible states as prime.

For continuous state spaces or for high dimensional discrete state spaces,

this sum cannot be computed exactly.

We would need some approximation for the value function to compute the sum involve.

Also, for continuous and multi-dimensional action spaces,

they arg max iteration involved in this term becomes

full blown multi-dimensional optimization that can further slowed things down.

Now, enter reinforcement learning.

Reinforcement learning method is a well-developed

as an alternative approach to problems of optimal control

that address the same problems that is dynamic programming that we just identified.

First, it does not require

any pre-specified "model of the world," but instead relies on samples.

Second, it uses function approximations to handle

various computational issues with the dynamic programming approach.

Also, because neural networks are able to

provide very flexible and scaleable function approximations,

they can be successfully applied in reinforcement learning.

Now, let's talk about each one of these points separately.

First, let's talk about a "model of the world".

As we just said,

the dynamic programming approach requires the "model of the

world" because it's needed to compute the Bellman iterations step.

Now,, if we do not know the dynamics,

which is almost always the case in real life,

we have to build the model for the dynamics and then estimate it from the available data.

This on its own might be a very challenging task

especially in high dimensional state-action spaces.

And by the way, state-action spaces for

financial applications are very often high dimensional.

When you construct such a model,

you almost surely will have some model misspecifications

just because all models are wrong even though some models are less wrong than that is.

This modeling efforts can affect your decision making

but you will not even know in which way unless you're

able to compare it with what you get using some other hopefully model independent ways.

But maybe you can do without [inaudible] explicit "model of the worlds".

And then, to our objective in optimal control is not

to explain the world but rather to learn optimal policy,

that is to act optimally.

As was elegantly put by Vladimir Vapnik,

the creator of support-vector machines,

"Do not use more concepts that then you need to explain the observed facts."

If our task is to want to act,

can we just focus on this task?

And exactly, reinforcement learning does exactly this.

Instead of using a "model of the world",

it uses data directly in the form of samples or simple trajectories.

Therefore, it can be viewed as

a data driven model-free dynamic programming that operates on samples of raw data.

Some reinforcement learning algorithms actually

built their own internal "models of the world",

and this is called "model-learning" within reinforcement learning.

If the model is just pre-specified but used with the reinforcement learning algorithms,

this is called a model-based reinforcement learning.

In particular, dynamic programming can be viewed as a model-based reinforcement learning.

Now, let's talk about the second talking point of

the major differences between dynamic programming and

reinforcement learning approaches to optimal control.

When you deal with the discrete state space,

you can simply enumerate all states and view

all functions you find on your status look-up tables.

This is called tabulated representational functions.

If this is your case, both value iterations and policy iterations algorithms

are simple as they involve finite sums with small number of terms.

But if you have many discrete states or, worse yet,

you have continuous states and maybe even continuous actions,

then these can not be done anymore.

Instead, you have to rely on some function approximations for the better function

or for the policy function or for both, depending on what algorithm you use.

So what reinforcement learning does,

it prioritizes such functions by

some parametric family of functions and then estimates there parameters.

These functions can be either linear or nonlinear in tunable parameters.

If you choose to work with nonlinear functions,

neural networks provide a very flexible function approximations.

Much along the lines are of all other machinery algorithms.

The main difference is what function is approximated

and what method is used to estimate its parameters.

Reinforcement learning uses the Bellman equation,

not for the developed function but for

a closely related function called the Q-function in order to find the best parameters.

In particular, deep reinforcement learning was developed by implying

deep neural networks as a function approximation within the Bellman equation.

We will talk at much more depth about these topics in our next course,

which is just one click away from this course,

which is about to enter a sum.

But before ending up, I would like to briefly mention another very exciting topic,

which is inverse reinforcement learning.

In inverse reinforcement learning,

we do not know the rewards obtained by the agent.

The objective in this setting is the following.

First, we want to find the reward function from observe data.

Second, we also want to find the optimal policy.

This is obviously a pretty ill-posed problems.

As many and, in fact,

an infinite number of possible reward and policy functions will be

consistent with the data If we don't know the rewards received.

However, it turns out that the problem can be made

well-posed if some additional constraints are added to the problem.

Researchers have developed a number of

feasible approaches to inverse reinforcement learning.

Some of them are quite computationally demanding.

For example, you can't treat the rewards as unobserved hidden variables in the spirit

of the EM algorithm and use the value iteration method as a component of this algorithm.

This idea is simple but quickly becomes very computationally intensive.

There are also more efficient methods for

the IRL that do not require solving the Bellman equation multiple times.

We'll talk more about both reinforcement learning and

inverse reinforcement learning and their implications in finance in our next course.

And this note, let's call it an end both for this lesson and for this course.

If you remember walking robots from our introductory video,

this robot was taught to walk using the state of the art

deep reinforcement learning using a machine learning library called trainer.

Here, we can see a link to their GitHub page that also has the code.

Take another look at the robots this time with the appreciation of

how much science went into creating such animation before moving on,

and have a good luck with your final course project.

Thank you for a time in taking this course and hope to

see you in our next course on the reinforcement learning.

You can find us.