So now let's take a little bit more formal look on this decision process, shall we?

It is a model widely used in the reinforcement learning,

it's called The Markov Decision Process.

It follows the same structure as we had in the previous chapter with

the kind of intuitive definition of decision process,

but it's slightly more restricted and has math all around it.

So, again we have an agent and environment,

and this time environment has a state denoted by S here,

and the state is what agent can observe from the environment.

So, the agent will be able to pick an action and send it back into the environment.

The action here is noted by A.

Those capital S and capital A are just sets of

all possible states and all possible actions because mathematicians love sets.

Now, this third arrow,

the vertical one, is how we formalized the feedback.

There is some kind of reward denoted by R. Again,

this is just the real number,

and the larger the reward gets,

the more agent should be proud of himself

and the more you want to reinforce his behavior.

Now this process was called Markov Decision Process for a reason.

There's a thing called Markov assumption,

which holds about such process.

Intuitive, it means that the state, the S,

is a thing sufficient to define

the environment state and there is nothing else affecting how environment behaves.

In terms of math, it means that whenever you want to predict

the probability of next state and the rewards your agent is going to get for his action,

you only need the current state of environment and agents action to do so,

and no other input will be helpful.

You probably noticed that from here on

the things get slightly unrealistic with this Markov assumption.

This actually means that if you want

to show your users absolute banners, what you need is,

you need a state that encompasses everything about the users that defines how he

behaves and this may or may not

include the quantum states of all the particles in his brain.

So, this is of course, impossible,

but just don't get too focused on it.

This is just a mathematical model and in practice you can,

of course, simplify it a little bit,

because models don't have to be accurate.

In fact, they're never accurate, they're just sometimes useful.

In this case, you can suffice with some kind of higher level features that you use

for your decision making process

and just pretend that everything else is around the noise,

which is what mathematicians usually do.

Now, as usual, we want to optimize our reward to our feedback,

but the difference here is that unlike our intuitive definition,

this time your environment can give you intermediate rewards after every time step.

Think of it this way,

you have your robots and you want your little robot to walk forward.

You can of course simply give him one reward per the entire session.

Whenever he falls, just measure how long was it

able to walk before it fell and reward him for this value.

But intuitively, you can try to give him some small pools of feedback whenever

he moves himself forward slightly over duration of one turn.

Now, for the purpose of the simple algorithm we are going to see now,

this is no different from what we had before,

because we want to optimize,

not just the individually rewards,

we want to optimize the sum of rewards per session.

So, we don't want to go as fast as possible right now.

Right now, we want to go as fast as possible over the duration of the entire episode.

This is also quite useful when you, for example,

train your agent to win a board game. Because in chess,

you can try to optimize the immediate reward.

You can try to say,

it has many pons as you can,

but this might result in you losing the game quickly,

because for the immediate reward has not always the best move.

In fact, it's often the worst move you can take.

Now, what you want to do with this process,

is you want to define an agent,

whether trained an agent, so that he thinks

actions in a way that gets highest ever reward.

This is from [inaudible] Basically,

you can think of policy for now as attributed to

distribution that takes a state and assigns probabilities to all the possible actions.

Now, in this case,

you can just use whatever Machine Learning model or table to build the distribution.

This is so far outside our scope,

but we'll get into the implementation details later this week.

Again, we have a policy and want to optimize the reward expected for the policy.

If you break down all the maths explicitly,

then you'll get the following weird formula,

which basically says that you have to,

well, sample the first state,

then take the first action based on this first state in your agent's policy,

then observed the second state and get your reward.

Then, take the second action,

third state, third action,

fourth state and so on,

until you reach the end of the episode,

then just add up all the rewards.

In my humble opinion, this formula below looks slightly

more uglier than this informal definition on the top.

So, it's only important that you grasp the concept.

You don't have to memorize this, of course.