Of course, it's not the case for humans.

Humans can easily generalize, and

this is a property you want from your reinforcement learning algorithms as well.

This is where we approach the so-called approximate reinforcement learning.

This time we don't use the tabular policy.

We don't just record all the probabilities explicitly.

But we use some kind of machine learning model, or

well, any model you want, to model this probability distribution given the state.

For example, it could be a neural network.

Basically a neural network that was previously used for

classification that will take a state, an image, maybe apply some computational

layers if you're familiar with image recognition with neural networks.

Then it would have output with a soft max layer with as many units as you have

actions.

Could also use any other method.

It could be regression, any other linear model, or a Random Forest.

Anything that can run the simulation for you.

And since you can no longer set the probabilities or

update them explicitly after each iteration,

you'll have to replace this phase with training your neural network.

Again, you play M games in this Doom environment.

Then you select M best of them, you call those elite games.

And then instead of recomputing the whole table, you simply perform

several iterations of gradient descent or building several new trees.

Well, let's consider the neural metric option for now.

In the case of neural network,

what you do is you initialize this neural network with random weights.

You probably remember some clever ways you can do so from the deep learning course.

And you take this network and use it to pick actions,

to actually place a 100 games in this Doom environment we had in the previous slide.

Then you take again the best sessions, several best sessions, and

you cold denote them as elite sessions.

And you train your neural network to increase the probability of actions in

the elite sessions.

You do so the way you usually train neural networks to classify.

You have states which are the inputs of your neural network, and

you have the actions in the session that serve the purpose of kind of the correct

answer, the target why, in the supervised notation.

Now you take some kind of stochastic [INAUDIBLE] algorithm say [INAUDIBLE], or

stochastic gradient descent or [INAUDIBLE] proper anything.

And you perform one or several iterations of network updates given those M

best sessions, then you repeat the process.