So, in the last video we talked about the setting of

bench reinforcement learning and what sort of data it works with.

Now, let's talk about how reinforcement learning works.

When the model is unknown,

we still deal with the same Bellman optimality equation that we had before.

To remind you of this equation, it says that,

"The time T optimal Q function is equal to the expected optimal rewards,

plus the discounted expected value of the next step optimal Q function."

Now, let's pay attention to

these expectations sign entering the right-hand side of this equation.

This is a one-step expectation,

involving the next step quantities which is conditional on

the information Ft available at time T. If we know the model,

this conditional one-step expectation can be calculated exactly in least in theory.

For example, for a discrete state model and given

a current state to compute the right-hand side of the Bellman optimality equation,

we would simply sum up over

all possible next states with

the corresponding transition probabilities as weights in this sum.

On the other hand,

if we deal with a continuous state model,

then the calculation of expectations should involve integrals instead of sums.

But again, if the model is known,

such integrals can be computed in a straightforward way at least in theory.

But what about the case when we do not know the model?

Well, in this case we can rely on

the most common trick of statistics and machine learning,

which amounts to replacing theoretical expectations with empirical means.

So, if we have samples that can be expressed as values from centre in the sum here,

then an estimate of the optimal Q function can be

obtained as an empirical mean of such values.

Please, note that this is in fact what Monte Carlo simulation does.

In our previous approach based on dynamic programming,

we first simulated four parts of

the underlying stock St from known dynamics and after that,

treated this generated sample size data to numerically

compute or estimate the option price and hedge.

Now, instead of generated data for the history

of stock prices in the setting of reinforcement learning,

we have actual price histories as a part of our information set

Ft for reinforcement learning that we discussed in the previous video.

So, if you observe stock prices in the rewards,

you might be able to rely on empirical means to estimate

the right-hand side of the Bellman optimality equation and hence to estimate

the optimal Q function at time T. But now what about

situations where we do not have a fixed historical data set to compute empirical means?

For example, for online reinforcement learning,

an agent interacts with the environment in real time

so that amount of available data increases with time.

Now, the question is,

can we adjust a sample-based estimation of expectations to handle such settings?

And the answer to this question is yes,

this is indeed possible and this is achieved

using the so-called stochastic approximations.

The most well-known stochastic approximation is called, The Robbins-Monro algorithm.

The Robbins-Monro algorithm estimates the mean without directly suming the samples,

but instead adding data points one by one.

Each time where the data point,

the algorithm iterates to the next step.

So, we can introduce the index k as

both the iteration number and the number of data point in a data set.

If we have the total of k points in the data set,

the algorithm can continue for all values of k between one and capital K. In each step,

it iteratively update the running estimation of the mean.

You know it here as hat X of k as shown in this equation.

Now, let's see what is what in this equation.