Now, let's talk about how we can actually achieve

the goals of maximizing the expected total rewards for Markov Decision Processes.

We said that the goal of Reinforcement Learning is to maximize

the expected total reward from all actions performed in the future.

Because this problem has to be solved now,

but actions will be performed in the future,

to solve this problem we have to define what is called a policy.

The policy is a function that takes

the current state S t and translates it into an action A t. In other words,

this function maps the state space onto the action states in the Markov Decision Process.

If a system is currently in the state described by factor S t,

then the next action A t is given by a policy function pi with S t as its argument.

If the policy function is a conventional function of it's argument S t,

then the output A t,

will be a single number.

For example, if the policy function is linear like pi of S equal one-half of S,

then for each possible value of S,

we will have one action to take.

This specification of a policy is called a deterministic policy,

but it turns out that this is not the only way we

can define a policy for a Markov Decision Process.

We can also consider stochastic policies.

In this case, the policy is presented by

a probability distribution rather than a function.

Let's take a look at differences between these two specifications in a bit more details.

First, let's talk about deterministic policies.

In this case the action A t to take is given by

the value of a policy function pi applied to a current

state S t. If the reinforcement orientation finds

itself in the same state S t of the system more than one time,

each time it will act exactly the same way.

Now, how it will act depends only on the current state S t,

but not on the previous history of states.

This assumption is made to ensure consistency

with the mark of property of the system dynamics,

where probabilities to see specific future states depend only on the current state,

but not on any previous states.

It can't actually be proven that

an optimal deterministic policy pi always exists for a Markov Decision Process,

so our task is simply to identify among all possible deterministic policies.

So, as long as that's the case,

it might look like deterministic policies is all

we ever need to solve Markov Decision Processes.

But, it turns out that the second class of policies,

namely stochastic policies are also often useful for reinforcement learning.

For stochastic policy, pi becomes probability distribution over

possible actions A t. This distribution may depend

on the current value of S t as a parameter of such distribution.

So, if we use a stochastic policy instead of deterministic policy,

where an agent visits the same state again,

it might take a different action from an action it took the last time in the same state.

Now, why would we want to consider such stochastic policies?

Well, if we know transition probabilities in the Markov Decision Process,

then we can just consider

only deterministic policies to find the optimal deterministic policy.

For this, we just need to solve the Bellman equation that we introduced in

our previous course and which we will review again in our next video.

But if we do not know transition probabilities,

we have to either estimate them from data and use again to solve

the Bellman equation or we have to

rely on samples following the reinforcement learning approach.

In this case, randomization of possible actions

following some stochastic policy may provide some room for exploration,

for a better estimation of the model.

I will come back to the notion of exploration in a second,

but first let me note that there is a second case

when using a stochastic policy may be desirable.

This case arises when,

instead of a fully observable Markov environment,

we deal with a partially observed environment.

This case is called a Partially Observable Markov Decision Process or POMDP, for short.

In this course, we will not deal with POMDPs,

but it's good to be aware of their existence,

at least, especially given the fact that

many problems in finance might be good use case for such setting.

However, this case is much more mathematically complex than a MDP setting.

Therefore, we will study the letter case first.

Okay. So, now, let's return to what is exploration and why we need it.

The notion of exploration appears in Reinforcement Learning in the context of

the so-called Exploration Exploitation Dilemma that

conceptualizes different possible scenarios or strategies in taking actions.

Such dilemma is specific to Reinforcement Learning and

does not appear in Supervised or Unsupervised Learning.

Again, the reason is that in Supervised or Unsupervised learning,

there are no multiple possible actions to consider.

The action is always fixed there,

but in Reinforcement Learning,

at each step we need to take one possible action among many.

Our ultimate goal is to maximize the total reward.

But at any given point in learning,

if the environment is unknown,

we may not know all states that might be encountered later.

Alternatively, an agent may need to try different actions in states it already saw,

possibly because taking different actions may change

the environment in a way that would produce higher rewards later.

So, we might need to try different actions and probe different states.

This is done via exploration.

We search for good actions and states, for example,

by taking purely random actions from time to time.

But if we do something like that too often,

we might end up with having a low final cumulative reward because during

its exploration by trial and error method

our agent visited too many states with a low reward.

An alternative approach would be to rely on exploitation instead.

Exploitation occurs when our agent just repeats in states that it

visits the same actions that provided good rewards in previous visits to these states.

But this obviously carries a risk of picking just good actions,

but not best actions simply because they may be some combination of

actions and states that may have even better rewards.

But the agent has no chance to know about them if it only

does exploitation of known actions and states.

This is why it's called the Exploration Exploitation Dilemma.

At each time-step, the agent should decide

whether it should explore or exploit in this state,

but it can't do both things at once.

Reinforcement Learning should ideally combine both exploration and exploitation,

for example, by switching between them at different time-steps.

How exactly this can be done depends on the particular case and there are

no universal answers on how

the Exploration Exploitation Dilemma should

be solved in a general Reinforcement Learning setting.

But it's important to note that

this dilemma is only relevant for online Reinforcement Learning,

when an agent interacts with the environment in real time.

On the other hand,

in batch-mode Reinforcement Learning,

we already have some data collected from actions of another agent.

This means that this other agent has already resolved in

some way the Exploration Exploitation Dilemma for this problem.

It might not be the best way or it may even be a bad way.

For example, the data might have been collected using purely random polls.

But, in any case,

such fixed data set is all that

an agent has in the setting of batch Reinforcement Learning.

It doesn't have access to a real-time environment

therefore it can't even contemplate any exploration.

Instead, it has to rely on a fixed data set and learn

optimal policies from this data set only.

We will be dealing a lot with this setting going forward.