It's a bit harder than the previous one, isn't it?

So the idea here is that,

it's really easy for you to understand what to do,

but the Q-learning doesn't explicitly learn what to do.

It instead tries to learn what kind of a value,

what Q function you'll get if you do this,

and it's kind of hard, especially if you consider this applied to everyday problems.

Let's say that you have a very simple problem of,

whether or not do you go for coffee,

so you can make yourself a coffee in another room.

You can either go there and drink a coffee and then proceed,

or you can stay here and avoid drinking coffee.

Now, what you actually do is,

you either feel like you want to do this,

or you feel like you don't, and this is very simple.

What Q-learning has to do,

is it tries to learn the value of your entire life from this moment to the end,

and it tries to add up all the rewards you're going to get in this day, next day,

the day after that with some Gamma-related coefficients,

and this is kind of impractical,

especially because it takes to predict what's going to happen in the future.

So when I say difficult,

I actually mean that it's not only difficult for you to add up rewards with Gammas,

it's also difficult for a new level to approximate.

Or for that matter, any other algorithm.

You have your DQN or something related to DQN,

trying to learn the game of breakouts,

or whether it wants to drink a cup of coffee.

You actually have a squared error minimization problem under the hood.

So what it tries to do, it tries to minimize the squared error between

the predicted Q function and the temporal difference kind of corrected Q Function,

which is on the right here.

So basically, it's reward plus Gamma times whatever.

And if you remember the last week,

we actually considered this last term,

Reward plus Gamma whatever to be constant,

not dependent on the parameters of a Q function.

When it comes to real world applications,

your neuro network size is usually insufficient

because otherwise, it will train for ages.

What it actually means that, your neural network would never be able to

approximate old Q values in a way that it has no error,

so it would have some approximation error there,

and what it actually tries to do,

it tries to make an approximation which

minimizes the lost function or a mean squared error in this case.

So now, imagine that you have two possible Q functions.

You have two possible outcomes of your learning,

and you are considering them on two different states,

the S zero and the S one.

Now in those two states,

you have two actions, A zero and A one.

Let's imagine that you only have two actions,

and that on all other states,

your neuro network's identical.

This is just for simplification.

The first clone here,

is the kind of true Q values,

the Q values that your Q and A neural network would actually going to

get if it does this particular action,

and then follows its policy or the optimal one.

Now, you have the first two rows corresponding to S zero.

In this case, the first action brings you the Q function,

the true Q function of one.

The second one brings you two,

and you have the second state, the S1,

and in this case, the first action brings you three,

and the second one brings you 100.

This is not very practical,

but it kind of serves the point of explaining the stuff,

so you have two networks, two possible approximations.

The first approximation is exact about the first state,

is the A:, so it captures one and two Q values exactly.

But on the second state,

it captures first action right,

but it fails to grasp the actual Q value of the second action.

Then, you have the second possible option of what you can learn.

In this case, the second state is approximately ideally, but the first state,

the S zero has its Q values freed,

so it has an error of plus one and minus one there.

The question is, which of those two Q functions would you prefer?

Which of them would get a better average of work per session,

or which of them will take the optimal action more frequently?