0:00

Now, after we found the coefficients Phi and T of the optimal action AT-star and time T,

we turn to the problem of finding coefficients omega NT for the optimal Q-function.

To this end, let's take another look at

the Bellman optimality equation that we derived earlier.

To find coefficients omega NT,

we consider this equation for

the optimal action A-star rather than for a general value of action AT.

First, let's know that we can interpret this equation as a regression of

the optimal function QT-star on the sum of the reward RT

and this counted next step optimal Q-function evaluated that their optimal action,

as shown in the second equation here.

We can check that these two formulations are

equivalent if we take expectations in both sides of the second equation.

Now, to use it as a regression to find parameters omega NT,

we need to know instantaneous rewards RT that enter this expression.

This is easy to do as we know the theoretical expression for the reward,

is given by the increment of the portfolio value,

which is equal to gamma times pi sub T plus one minus pi T and

minus a risk premium apart equal to lambda times day and sub pi T. As we saw before,

the reward RT can also be written more explicitly as

a quadratic function of the time T action AT.

Therefore, as we go backwards in time in the process of backward recursion,

once we compute the optimal action AT-star,

we plug this into the formula for the rewards to compute the optimal reward.

That is the reward evaluated that the optimal action.

So, now, we know the dependent variable for our regression,

which is equal to the sum of the optimal rewards

and the discounted next step, optimal Q-function.

The predictor in this expression,

which stands in the right hand side is the time T optimal Q-function.

Now, by substituting here the extension of Q-star in basis functions,

coefficients omega NT become regression parameters in this regression.

To find them, we proceed as we always do with regression and look for

coefficients omega NT that minimize the square root loss function FT of omega shown here.

Again, because this is a quadratic function of coefficients omega NT,

this minimization can be performed semi-analitically.

We introduced another time-dependent pair of a matrix CT with elements CNM,

and vector DT with elements DN,

and matrix CT will have dimension M times M,

and vector DT will have lengths M.

The solution of the regression problem for coefficients omega NT

can then be given in a vector form as the inverse

of the matrix CT multiplied by the vector DT.

Again, because when you are here with the matrix inversion,

in practice it might be a good idea to regularize matrix C by adding

an identity matrix with

a small regularization parameter epsilon before it works in the matrix.

The solution is then given by the inverse of this regularized matrix multiplied by DT.

And this completes all calculations we have to do at

each time step T. We have exactly two such calculations, one,

linear regression to find the optimal action and another,

linear regression to find the optimal Q-function at the optimal action.

Therefore, for T times steps,

we will have the total of two times T linear regressions.

We have to compute in order to price and hedge an option.

Let's now summarize the whole procedure.

In the backward recursion,

we start with time capital T minus one and go back all the way to time equals zero.

And at each time step T,

we do the following.

First, we compute the matrix AT and the vector BT.

Then, we compute coefficients Phi and T that determine the optimal action AT-star.

Then, we use this value to compute

instantaneous rewards that correspond to this optimal action.

After that, we compute the matrix CT and the vector DT.

And this produces coefficients omega NT and

has the optimal Q-function evaluated the optimal action.

The computed optimal Q-function for this time step is then used as

the next step optimal Q-function for

the next time moment T minus one of the backward recursion.

When we run this backward recursion all the way to time T equal zero,

the last optimal action will give us the optimal hedge of the option now,

and the optimal price will be given by the negative

of the optimal Q-function with the optimal action as the second argument.

So, we have a complete solution to the whole problem.

We can use this solution when the dynamics are known.

In fact, this Monte Carlo dynamic programming solution does not require a full knowledge

of modal dynamics as long as sample passed off the underlying stock are available.

The only additional thing needed to implement

the dynamic programming Monte Carlo based backward recursion

that we presented here is the knowledge of

risk aversion parameter lambda and the value of the stock drift Mu.

In the next week,

we will see how the backward recursion can be implemented using

reinforcement learning methods that do not assume that we know lambda.

Let me now conclude with an illustration of performance

of dynamic programming approach to option pricing.

Let's see what is shown here.

This example applies the dynamic programming approach to price put option with

the initial stock price of $100 and the strike of $100,

which is known as at the money or ATM option.

The risk aversion rate lambda is set here to 0.001.

The stock volatility is 15 percent and option maturity T is one year.

The stock drift is taking here to be five percent.

And the risk-free rate is three percent.

Now, the top left figure shows a randomly chosen

10 parts of the Monte Carlo simulation for the stock price.

The top right figure shows corresponding values of the state variable XT.

As expected, the evolution of the state variable XT doesn't show a drift.

You can also know that for particular parts shown on the left for the stock price,

the drift of ST even though it's there is not quite visible,

but this is just the visual effect.

Parts of ST do have a drift but parts of XT do not.

The left figure in the middle row shows optimal actions obtained on

these parts and figure on the right

shows the corresponding values of the replication portfolio.

Finally, the left figure in the bottom row

shows rewards obtained from taking optimal actions,

and the right figure shows the optimal Q-function.

As you can see,

it converges at time zero to develop 4.9,

while the Black-Scholes' price is 4.53.

If you want to check convergence to the Black-Scholes' option price,

you can do it by making the risk aversion rates smaller,

and hedging frequency larger.

This actually will be a part of your homework for this week

along with the actual implementation of these dynamic programming scheme,

and on this note I would like to wrap up this week.

But, obviously not before we make a quick stop here for

some questions and see you in the next week.