Now, let's talk about Markov Decision Processes,

Bellman equation, and their relation to Reinforcement Learning.

This is obviously a huge topic and in the time we have left in this course,

we will only be able to have a glimpse of ideas involved here,

but in our next course on the Reinforcement Learning,

we will go into much more details of what I will be presenting you now.

So, this video is both a crash intro into Markov Decision Processes and

Reinforcement Learning and simultaneously an

introduction to topics that we will be studying in our next course.

So let's start. Let's draw again a diagram describing a Markov Decision Process.

I made two changes here in comparison to a diagram that we saw in a previous video.

First, I placed the name of the state vector from YT to ST,

to be more consistent with common notations in Reinforcement Learning literature.

Second, I added vertical arrows showing rewards obtained by the agent upon taken actions.

Now, let's give a bit more formal explanation to what is involved in this picture.

First, we have states ST.

These states belong in some space of states S,

that can be either discrete or continuous.

Second, we have actions AT that belong in some space A,

which can also be either discrete or continuous depending on the problem.

Third, we have transition probabilities to

a new state S sub T plus one given the current state S sub T and action taken AT.

Fourth, the reward function is a function that produces

a reward given the current state ST and action taken AT,

and finally, Gamma is a parameter between zero and one that is called a discount factor.

If you have experience with quantitative finance,

this notion should sound very familiar to you.

In finance, discount factors are so used to express the fact that

money now is always better than the same money later.

In other words, it reflects the time value of money,

the same way in Markov Decision Processes,

the discount factor reflects the time value of rewards.

This means that getting a larger reward now and a smaller reward later is

preferred to getting a smaller reward now and a larger reward later.

The discount factor just controls by how

much the first scenario is preferable to the second one.

Finally, the quantity of ultimate interest for us is the cumulative reward.

It's defined as a sum of discounted future rewards such as reward from

each next step gets an additional power of discount factor in the whole sum.

The sum extends over all time steps into the future.

If we look for an infinite horizon in

Reinforcement Learning the number of terms in the sum is infinite.

In this case the presence of the discount factor

becomes critical to ensure that the total sum is finite.

All right, we defined the cumulative reward,

but this reward refers to the future and depends on future states and actions.

So, it is a random quantity and we cannot directly maximize this quantity,

but we can maximize the expected total reward,

where the expectation is taken with respect to

all future realizations of states and actions.

This is the standard approach in Reinforcement Learning and we will follow with it next,

but let's pause here for a moment to grasp what it means.

As total future reward is random,

we should describe it mathematically by sum distribution.

Now, by focusing only on the expected total return,

we're in fact saying that we only care about the mean of this distribution.

Now, let's assume that we are talking about

the problem of multi-period portfolio optimization,

which is the main business of asset managers.

In this case, your actions will be changes of your portfolio positions.

If you think you can do well by building

investment portfolios by only looking at expected returns,

well, you probably have to think again,

because it's likely to be a recipe for a disaster.

No financial professionals build portfolios by only looking at expected returns.

You also look at risk of such portfolios,

which can be defined for example,

by the variance of the distribution of portfolio returns.

There indeed exist extensions of the basic formulation of Reinforcement Learning that

actually let you accommodate some measures of risk of the total return distribution.

This is called Risk Sensitive Reinforcement Learning.

While this topic is very relevant for finance,

we will only talk about risk in

the Reinforcement Learning in our next course in this specialization.

For now let's continue with

the Standard Reinforcement Learning approach based on the expected total return,

which is also sometimes referred to as Risk-Neutral Reinforcement Learning.

Let's just remember to get back to the problem of

risk in Reinforcement Learning in finance later.

We solve our objective in Reinforcement Learning by

specifying what is called the policy function, Pi.

Policy function simply prescribes

what action the agent should take in each possible state of the environment.

So the policy that maps the states space onto the action space,

this is the policy function.

Now, the policy can be a deterministic function of the state

or it can be a probability distribution defined on the state space.

The latter case is called a stochastic policy.

Again, we will talk more about stochastic policies in our next course,

but here let's continue with deterministic policies Pi.

The way to quantify any given policy is by defining the value function,

which is a function of the current state S and the function of the policy Pi.

Therefore, it's commonly denoted as a function V of S with

an upper ski Pi to denote it's dependence on the policy.

The value of function VS Pi is simply

the expected total future reward obtained starting from state S and following policy Pi.

The value function satisfies the so-called Bellman equation,

which in many ways can be seen as the main equation of Reinforcement Learning and

therefore fully deserves to be placed in the frame on this slide.

The Bellman equation is a recursive equation that expresses

the value function in terms of itself albeit withn the different argument,

that corresponds to the state of the system at the next time step.

It says that the value function now is equal to the reward that we obtain now,

plus its own discounted expected value from the next time step.

If you look carefully at the second term here,

it describes exactly such discounted expectation.

It's expressed as a sum of probabilities of transitions to

all possible future states multiplied by the value functions in these states.

So if for example your state space S is discrete with a small number of possible states,

which we can call mode S, then,

the Bellman equation becomes equivalent to a system of

modest linear equations that can be solved using linear algebra,

but in a general case,

the Bellman equation can only be solved numerically.

We will talk about it in the next video after your original worth.