All right. Let's start with where we stopped in

the last course and have a quick recap about market decision processes,

Bellman equation and their relation to reinforcement learning.

After we go over these topics to refresh our memories in this lesson, in the next lesson,

we will spend some time converting one of the most famous classical financial problem

into a Markov Decision Process problem that we

will use to test different reinforcement learning algorithms.

So, to recap, reinforcement learning deals with

an agent that interacts with the environment in the setting of

a sequential decision making by choosing

optimal actions among many possible actions at each step of such process.

In our first course, we referred to such tasks of machinery learning as action tasks.

The agent perceives the environment by having

information about the state St of the environment.

The environment may have some complex dynamics therefore

reinforcement learning tasks should involve some planning and forecasting to the future.

More to these, actions AT that the agents should pick

at each step to optimize its longer term goals,

may themselves impact the state of the environment.

And this creates a feedback loop in which

the current agent's action AT may change the next state of

the system which in history may have an impact on

what action the agent will need to pick at the next time step.

The presence of such feedback loop is unique to reinforcement learning.

No feedback loops ever appear in supervised or unsupervised learning.

And this is because there is no question of optimization of actions in these settings,

as action in these tasks is always the same.

For example, in unsupervised learning,

our task may be to cluster data.

And clearly, in this case,

the data does not care about how we or an agent looks at it.

So, there is no feedback loop.

We also talked about two possible settings for reinforcement learning.

Online reinforcement learning proceeds in real time.

In this setting, an agent directly interacts with

the environment and chooses it's actions every time step,

once it gets information about the new state of the environment.

A vacuum cleaning robot would be a good example of online reinforcement learning.

Another possible setting is called batch mode or off-line reinforcement learning.

In this case, the agent does not have an on-demand access to the environment.

Instead, it only has access to some data that stores

a history of interaction of some other agent or a human with this environment.

This data should contain records of states of the environment,

actions taken, and the rewards received for each time step in the history.

Now, in both comes to various types of environments,

we talked about two possible approaches.

If the environment is completely observable,

its dynamics can be modeled as a Markov Process.

Markov processes are characterized by a short memory.

The future in these models depends not on the whole history,

but only on the current state.

The second possibility is a partially absorbed environment

where some variables that are important for the dynamics are not observable.

As we discussed in the last course such situations

can be modeled using dynamic latent variable models.

For example, hidden Markov models.

In this course, we will be primarily concerned with fully observable systems.

So, we will stick to Markov processes for a while.

Now, as we outlined in the last course,

proper mathematical formalism that incorporates agents actions into

some Markov Dynamics for the environment is called

Markov Decision Processes or MDPs for short.

Let's go over this framework once again.

Here, you see a diagram describing a Markov Decision Process.

The blue circles show the evolving state of the system ST, discrete time steps.

These states are connected by arrows,

they represent causality relations.

We have only one arrow that enters each blue circle

from a previous blue circle which emphasizes a Markov property of the dynamics,

which means that each next state depends only on

the previous state but not on the whole history of previous states.

The green circles denote actions AT taken by the agent.

The upward pointing arrows denote rewards

RT received by the agent upon taking actions AT.

Now, in mathematical terms,

Markov Decision Process is characterized by the following elements.

First, we have space of states S,

so that each observed state ST belongs to the space.

The space S can be discrete or continuous.

Second, there are actions AT that belong to a space of actions called A.

Next MDP needs to know transition probabilities speed that define probabilities of

next states S sub T plus one given

a previous state ST and an action AT taken in this state.

Further, we need a reward function R that gives

a reward received in a given state upon taken a given action.

So, it maps cross product of spaces S and A onto a real-world number.

And finally, MDP needs to specify a discount factor gamma,

which is a number between zero and one.

We need the discount factor gamma,

to compute the total cumulative reward given by the sum of whole single step rewards,

where each next term gets an extra power of gamma in the sum.

This means that the discount factor for an MDP plays a similar role to

a discount factor in finance as it reflects the time value of rewards.

This means that getting a larger reward now and a smaller reward later is

preferred to getting a smaller reward now and a larger reward later.

The discount factor just controls by how

much the first scenario is preferable to the second one.

Now, the goal in a Markov Decision Process problem or in reinforcement learning,

is to maximize the expected total cumulative reward.

And this is achieved by a proper choice of a decision policy

that should prescribe how the agents should act in each possible state of the world.

But note, that this task should be solved now as we need to know the value function now.

We can only know the current state of the system but not its future.

This means that we have to decide now on how we are going to act in the future in

all possible future scenarios for the environment so that

whenever each the expected cumulative reward would be maximized.

But please know that we set on average.

Our decision policy may be good on average while having a high risk

of occasionally producing big failure that is very low value function.

This is why the standard approach of reinforcement learning that focuses on

the expected cumulative reward is sometimes called risk-neutral reinforcement learning.

It's a risk-neutral because it does not look at risk of a given policy.

Other versions of reinforcement learning called risk-sensitive reinforcement learning

look at some higher moments of the resulting distribution of cumulative rewards,

in addition to looking only at

its mean value as is done in a conventional risk-neutral reinforcement learning.

And this might be helpful in certain situations.

So, I encourage you to take a mental note of the mere availability of such approaches.

But for the rest of this course,

we will be dealing with the standard formulation

of reinforcement learning that focuses on maximizing

the mean cumulative return or in

other words it looks for action policies that are good only on average.