In a previous lesson,

we spoke about Markov Decision Processes and about how

reinforcement learning can be used to find optimal policies for MDPs.

We also looked at the pole balancing problem as one of

the most classical test cases for reinforcement learning algorithm.

Such test cases are very useful as they not only let us

see how a particular reinforcement learning algorithm works in

a setting when we know what to expect but also,

can compare different algorithms in terms of their performance.

Though the physics of cart-pull system deals with

continuous variables such as positions and velocities,

these variables can be discretized.

Upon discretization, the system is mapped onto

a Markov Decision Process problem with a discrete state and action space.

All methods that we outlined before plus

many other methods of reinforcement learning for MDPs can be tried in such setting.

Therefore, the test case is often used

to illustrate and benchmark different reinforcement learning algorithms.

But because in this course we deal with finance,

it would be good to have a setting for finance

that would be as simple as the pole balancing problem.

Such a simple but not simpler one setting could be used for financial applications of

reinforcement learning as a testing laboratory for exploration

and benchmarking of different Ariella algorithms for financial applications.

In this lesson, we will construct

such a testing environment for reinforcement learning in finance.

As we will see later,

it's very flexible and extendable.

In particular, it will let us benchmark

both discrete action and continuous action reinforcement learning algorithms.

On this side of finance,

it offers a look into the problems of hedging,

trading, and pricing in financial markets.

All the main elements of

many financial tasks but in a controllable and well understood environment.

And last but not least,

it has a chance to be extendable upon

appropriate generalizations into practically useful methods.

To this end, we suggest to use the famous Black Scholes Merton or BSM model.

Cornerstone of modern quantitative finance.

I mentioned this model to you in our first course on

supervised learning when we talked about Merton's model of corporate defaults.

So, let's start with a quick recap of the Merton model for defaults.

The Merton model offers a simplified view of a firm that has

a single asset called the firm value that is modeled

as a geometric Brownian motion with a drift.

If the firm value is below the level of debt,

at a debt maturity time T,

an equity holder defaults on payments to bondholder that finances the firm.

In this case, the firm asset to the bond holder and the stock becomes worthless.

Otherwise, the stock value at time T is equal

to the difference between the firm value T and

the debt level K. We discussed

implications of the Merton model to predict corporate defaults,

and said that particular expression obtained in this model for a default probability,

can be viewed as a structural model based default model

that corresponds to very specific assumptions about the dynamics of the system.

Now, the Merton model for corporate default,

gives the default probability,

and also computes how much both the debts of

the firm and the stock of the firm are valued now.

But what if the stockholder offered you to pay her

now in order to be able to buy your future stock from her at a later time T?

Right after she gets it,

for some fixed price K that is fixed now.

Such financial contract is called an option or a financial derivative.

Both names explain the nature of this instrument.

First, it is an option and not an obligation,

which means that you will not have to pay a price K at time T to get

the stock if you see by that time that this stock does not perform well.

It's also called a derivative because its value is derived from a value of the stock.

What I just described is the most simple type of

stock option called the European call option on a stock.

To reiterate such a financial contract works as follows,

a buyer of such option gets a right to get the stock

at a later time T for a pre-specified amount K,

also known as the option strike.

The strike K here is set contractually at the the start of the option.

Now, the value of the option at the maturity time T is given by the formula shown here.

It is a maximum of the difference between

the final stock price ST and the strike K and zero.

The meaning of this is simple.

If the final stock price ST is above the strike K,

then it makes a perfect sense for a buyer of such options to get the stock for

price K because the stock can be immediately resold for the market price.

The profit to the option buyer in this case would be the difference ST minus K. But,

if the terminals stock prices is below K,

then exercising such an option would not make sense for a buyer.

So, the option payoff is zero in such case.

This option pay off at maturity time T is shown on the graph here.

The payoff is zero for any terminal value ST below K,

and grows linearly for values of sub ST above K. This

is how much the option was at the final time T. But what about its price now,

when we don't know the final sell price?

The celebrated Black Scholes Merton model also referred to as the Black Scholes Model,

was developed to answer exactly this question.

It was developed around the same time as

the Merton Corporate Default model in 1973 by Black Myrow Charles Robert Norton,

and was awarded a Nobel Prize in Economics in 1997 to Scholten Norton,

three years after Fischer Black died of cancer in 1994.

The model was commonly referred to as the Black Scholes model,

and sometimes says the Black Scholes Merton model or BSM model.

We will be using both names interchangeably.

So, what is the BSM model and what does it do?

Let's talk about this model in our next video.