0:00

So, I just learned about this cool architecture,

the deep [inaudible] called [inaudible].

By now, you probably are convinced that,

it's capable of doing all those cool things like finding

an optimal policy in an Atari game or other video game to give [inaudible] feet.

In particular, D.Q.

Network is just Q-learning,

which uses neural networks as a policy and features

a couple of dirty hacks like experience,

play, and target networks.

The dirty hacks but hacks, nevertheless.

The network itself is basically just convolutional network,

which thinks your image pixels in particular for last images,

and then feeds into a set of convolutional of yours here on

the network board in the model and the blue square

then uses those features from the body from the blue square to compute

the Q-values by yet another densed layer without knowing the [inaudible] of course.

Now, this particular network is trained by minimizing the

squared [inaudible] as [inaudible] has already told you.

The scorched air between its current Q-values

and B and refined Q-values that are considered constant.

This key is a reward plus gamma times maximum of the,

basically next state value.

You do the immunization by following some kind of [inaudible] batch growing dissent.

Would it be usual growing dissent,

sarcastic varying dissent or Adam,

or animals proper other new make algorithm

that improves convergence given this minibus context.

Now, this is how it works and now we're going

to going to study a few peculiar things about how it doesn't work sometimes.

To begin with, we have this right hand side of the lost function.

We'll have reward plus gamma times maximum of Q-values and

the first thing you've properly learned by now is that to actually reform immunization,

you consider this thing constant.

So, you do not propagate [inaudible] through it,

you just instruct your [inaudible] floor [inaudible]

to consider it as a constant value for this particular iteration.

This is probably explained to you.

But this particular formula,

the reward plus gamma times maximum compression

has a number of other problems that we have not yet covered.

For instance, consider the following situation,

between your network and in a particular state,

here S prime, you have three possible actions.

These are a zero,

a one and a two respectively.

In fact, each action can be stochastic due to a number of causes.

So each action will yield you action various that's range, that's vary overtime.

For starters, you network will have some approximation error and due to Bergin dissent,

you will have some perturbations in those Q of s prime, a zero,

a one, a two just because neutral trains and between the durations,

it's vary changes lightly.

Another issue here is that the outcomes are often stochastic,

and therefore the Q of s prime,

a and so on may vary as well.

Now, the issue with those stochastic parts is that once

we actually apply the formula we use for Q-learning,

we get some of the unintended effects.

So, let's say we have three Q-values,

this is a thinking breakout, and the Q-value,

the actual value of all possible actions are equal to,

say 10, with some sum of deviation of say, one.

So, these are three bell-curve-like,

Gaussian-like effect may be in distribution.

But let's say they are Gaussian just for the sake of a need display here.

So, have those three things and what you want to do is

want to compute the value of the state,

the maximum of action values.

Now, since those action values,

the true action values are in fact all equal,

equal to 10, the state value is equal to 10 as well.

But what you'll find out is that,

if you actually only access one sample from those distributions,

if you only get one outcome and you'll choose C,

is you'll see that you don't beat the maximum expectation.

You beat the maximum over samples and you basically expect over

those situations and you train your- The question to you is, have those two situations,

you have the maximum of expectations which is obviously equal to 10,

as all the expectations are equal to 10,

and you have just the maximum,

and I want you just to meet the expectation on this maximum,

will it be equal to 10 or will it be smaller or larger?

What kind of value would you expect? Well, right.

Based [inaudible] tells us that it should be higher in it's expectation again.

If you consider those three actions here and the original,

the kind of the expected,

the maximum over expected action of values is

this blue curve on the left in the new picture.

Then, if you take maximum over samples and

computes those maximum over respective samples,

you'll get the green color, which is not equal.

It's quite optimistic.

This happens because if you draw three samples from one distribution,

their maximum is probably going to be to the right,

it's going to be larger because maximum.

You can, of course,

get a more formal deviation in any statistics book who will

introduce some explanation linked to the Wikipedia grading section.

General idea is that,

if you use this maximization over samples,

you'll get something which is larger than what you actually want.

Therefore, if you have some particular states in which your value is filtrate,

for example, network is still trying to learn it.

Then your network will be overoptimistic,

it will over-appreciate being

in this state although this only happened due to the statistical error.

So, this is the problem which causes actual DQ and

to get as optimistic as it actually explodes.

So the Q-values become larger and larger over time on

some games and sometimes they never get back.

So, they are being optimistic all the time.