Now, let's find out how do we solve this process.

Of course, reinforcement learning wouldn't be

the entire course if there were just one way you can solve all those processes.

But the general case of a solution should look like this.

You have your policy,

your probability distribution or the assignment of actions to states,

and you initialize it at random or via some kind

of prior domain knowledge you currently use for your particular problem.

Then you intensively improve your policy.

You do so by, well, getting data,

playing a few games, playing sessions,

treating patients or showing balance,

and then adjusting your policy in a way that is more efficient,

as you can tell given those sessions.

Then repeat those process as your policy gradually gets better.

Now, one algorithm which is very into

the station of this pipeline is the crossentropy method.

Now, just as a quick disclaimer,

crossentropy method is not strictly an enforcement learning algorithm.

It's general procedure for second skeptmization.

But for the purpose of our course,

the crossentropy method does the following things.

It's starts with say, a random policy.

So, all actions are equally likely to be taken.

And then it plays a few games,

a few sessions, with those random actions.

Let's use the bicycle example.

So, you're trying to ride a bicycle.

You're equally likely to steer the pedal or

steer the wheel in any direction and then you do so for say, 100 times.

Well, initially some of those attempts will be more like you than others.

Sometimes you will fall right away,

sometimes you'll be able to ride for a few meters.

Now, the intention here is you have

those more likely sessions where you were able to fare better than falling right away.

And you want to do the actions but you did them in those sections more frequently.

So basically, you notice that if you

steer in the direction opposite to the direction of where you're falling,

then you ride for a longer duration.

Basically, you now want to increase the probability of doing so,

so that you'll be more likely to survive longer,

not fall for longer duration.

And then you repeat the whole process again.

Now you're more proficient at riding bicycles,

you're more likely to not do stupid things right away.

You play another 100 sessions.

You pick those of them which are more efficient.

So previously it would be not falling right away.

Right now, it would be not falling in one meter.

Maybe riding for say, 10 Meters say, in a straight line.

And then you adjust, adjust and adjust policy using this simple iterative algorithm

until it reaches the optimal strategy and you are able to ride a bike like a pro.

Now formally, what it does is,

well, it plays N sessions,

picks M best of them,

denoted as elite sessions,

and then adjusts the policy to increase the probability of actions in elite sessions.

Now, in the simplest case,

your policy is just a table in which you remember all the probabilities.

So you have 10 possible situations in which you can ride your bicycle.

Maybe you're falling to the left, falling to the right,

sit at the middle, going fast,

going slow and so on.

And for each type of situation, you have probability of steering left or steering right,

or maybe accelerating and decelerating.

What you want to do,

you want to initialize those probabilities to something like

a uniform random distribution which is, in this case,

useful because it ensures that you're likely to try all the actions and then you play,

you use this table and you pick actions by sampling from those probability distributions.

And so you play for say,

you try riding a bicycle for 100 consecutive attempts.

Then you pick, if it's 100,

say 10 or 25 of those attempts.

You call them elites,

then you construct a new metrics,

which is following the probabilities distribution

in the elite sessions. So, it looks like this.

More formally speaking for each state,

you remember how many times you appeared in

this state and then how many times you just take actions,

say you did steer left in this state from the elite sessions.

You don't care about the unlikely ones,

you only care about how often did you turn left here in the elite sessions.

And if in elite sessions this action was the most likely one,

then it's likely that by doing this action, you'll get higher reward.

So you get a higher probability to steering left in this situation.

So again, the intuitive way of thinking about it as that to you just, for each state,

you count how many times you visited the state

and you count how many times you took action a in this state and

you divide the amount of times you took action by the amount of

times you appeared in this state to reach the probability distribution.

And then you repeat the whole process.

You now use your new probability distribution to play

another 100 games or write another 100 times and this time hopefully, you'll fare better.

Then you get even better by updating the policy again.

Now, this particular algorithm is very efficient,

but it has a few drawbacks.

The first one, the simple one,

is that if you're only riding for 100 sessions,

it may happen that there's some rare states that you only

be once or twice during those 100 sessions.

And in this case,

you are likely to get a very weird probability distribution.

So it will be like 100 percent doing

one action because you only had one chance to pick this action.

Now, there is another solution of basically

increasing the amount of attempts from 100 to 10,000,

but falling from a bicycle 10,000 times hurts.

So we want something better. And here you can,

for example, like you can use these moving rule.

You can use the opalescent smoothing one which can be a new policy.

What actually happens here is that you add

some small positive number to all the probabilities, before you normalize.

And this results in you never getting the probability of zero,

because you always add some small number.

So, even though you might have reached some particular state only once,

you'll probably be able to take another action next

time because you want to converge to 100 percent this action from just one iteration.

It gets much more complicated when you try to apply this,

the new crossentropy method algorithm to the stochastic processes.

Let's say you have an environment which has some sort of randomness in it.

So, you are in a casino, you have two actions.

You can walk away, leave the casino for good.

Or you could get to the nearest slot machine,

insert a dollar, then pull the lever and in most cases you simply lose a dollar.

But in some cases you win say, three dollars.

Now, the problem here is that if you play this game

for- if you repeat this thing for a hundred times,

not only you'll probably lose 100 dollars,

the problem here is that if you do so,

you'll get both situations where you are lucky because

you choose the right option of walking away from the casino,

but also situations where you got lucky by pulling a lever and then

just getting the lucky outcome that had nothing to do with the algorithm.

Now, in this situation,

the problem is that if you select 10 or 25 best sessions,

they are very likely to be biased towards the cases where

you pulled the lever and just got the like outcome with plus three dollars.

Now, the problem here is,

as you have already guessed, that the ancient rule gamer addiction.

It will think that whenever it pulls a lever,

it gets plus three dollars and it will

keep pulling the lever and until it loses all its money.

Now, this is problematic.

So, as kind of under defiant quiz,

I want you to device if- maybe you can find some way to adapt crossentropy method to

this stochastic situation in order not to get fooled by these likely events.

So, we have just deviced a method that can solve,

theoretically, any problem within some reasonable range.

It can optimize a strategy regardless of whether

this strategy is a strategy of showing banners to users or

a strategy of turning a robot so that it walks forward without

falling or translating from one language to another, anything.

But this crossentropy method seems to be a rather general purpose algorithm.

It can apply to pretty much anything from student robots,

to trying to optimize advertisements,

to recommendation, to machine translation, to financial, anything.

The only problem with this general purpose algorithms usually seems to be that they are

universally worse than the special purpose methods designed for the problems they solve.

Now, the only issue with crossentropy method that we have left untackled

yet is how do you apply it efficiently to the problems at a greater scale?

Now, with a bicycle riding problem with say 10 states and maybe 4 actions,

it's really probably sufficient to solve it.

But any problem solving algorithm can do that.

Let's see how crossentropy method actually applies to something like, well,

steering an autonomous self-driving car,

or playing a game from a throw feedback.

The problem here is that it's no longer possible to store

a table with action probability is for all the possible states,

because you don't have 10 states,

you have either say,

begerian states or even the continuous space of states,

which is even technically impossible to record.

If you consider camera feeds from your robots or from your game,

and you'll probably find out that the amount of possible camera images,

is as large as color depth,

say 256 to the power of the amount of pixels.

So if it is even 100 by 100 pixels image,

which is super small, you'll have 256 times 10,000 which is insane.

You won't be able to record this on any hard drive ever.

Or maybe you will in a few years.