Because dynamics in the presence of markets impacts are non-linear,

we use an iterative method to solve this system.

The method perform iterative linearization of

dynamics and then updates the Q function and

F function written as

quadratic expansions around a reference value of the state variable.

But as we said before,

this setting with absorbed states actions and rewards is not the only possible setting.

Imagine for example, that you have

a historical market data and historical portfolio data made of

portfolio positions and trades that is

actions but you do not know the rewards received.

Such problem may arise even if you have portfolio trading data.

But this data is obtained using trading strategies that do not use

mean-variance market's type optimization that we assumed when we set up the problem.

In such case, a trader may not even know what their resolve risk aversion lambda,

the data correspond to.

But even if a trader does not think in terms of maximization of

risk-adjusted returns that involves a risk aversion parameter lambda,

traders actions can be consistent with some values of lambda.

However is this value or values of lambda are

unknown it also means that we do not know the rewards either.

Because rewards need to be computed from states and

actions by relations that depend on parameters lambda and MU.

So, what can we do in this case?

Well if rewards are not observed,

then instead of reinforcement learning we can use inverse reinforcement learning.

Inverse reinforcement learning or IRL deals with

problems where we only observe states and actions but not rewards.

The problem of IRL is to

find the actual reward function and the optimal policy from data.

In general, it's more complex problem than

the direct reinforcement learning because now we have to

find two functions rather than just one function from data.

However, if we deal with a parametric model then

both the policy function and a reward function are

functions of the same set of variables that

includes lambda impact parameter MU and other parameters.

In this sense finding

the reward function and action policy function

becomes the same problem as both are expressed in terms of the same set of parameters.

In this setting, IRL becomes almost as

easy or as hard as the direct reinforcement learning.

In particular, if we work with stochastic policies as we

did in our course on reinforcement learning,

then the result in policy would be a probability distribution.

Once we have it as a function of original model parameters we can estimate

these parameters simply by using maximum likelihood for observed trajectories.

Once these parameters are found we can compute rewards as they depend only on states,

actions and a model parameters.

Therefore by doing maximum likelihood on

absorbed trajectories with a parametric model we also compute the reward function.

Finally, one more interesting problem formulation

is obtained when neither rewards nor actions are observed.

Two main questions here are; first,

where can we have such settings,

and second how we should proceed about it.

Let's first discuss why such problem formulation can be interesting.

I can think of at least two problem formulations where it can be of interest.

The first one arises in intraday trading.

If you work for a large dealer whose trades can substantially move

the market via market impact of trading

you may want to know strategies of your competitors.

You can see market prices but you cannot directly observe actions of your competitors.

However, if you have an estimate of a portfolio of your competitor,

and an estimate of a planning horizon of

the competitor then you can still do inference of

your competitors section policy if you

treat their actions as unobservable or hidden variables.

In this case, you can use algorithms that work with hidden variables to make inferences.

One such algorithm is the EM algorithm that

we discussed several times in the specialization.

Another setting where we only observe states but not actions arises when

we consider market dynamics using the approach of inverse reinforcement learning.

This actual will be the topic that we will discuss in our next video.