[MUSIC]

So let's return to our problem of estimating the gradient of

the objective with respect to the parameters phi.

In the previous video, we discussed that if we use something called [INAUDIBLE].

We can build a stochastic approximation of this gradient.

But the variance of this stochastic approximation will be really high.

Therefore, will be really inefficient to use this approximation to train

the [INAUDIBLE].

So let's see, let's look at a really nice and simple and

brilliant idea on how to make this think much better.

Make this approximation much better.

So, let's make a change, let's first of all recall that

ti is a sample of distribution from q of ti, given x sin phi.

Let's make a change of labels.

So let's say, instead of sampling ti, we'll sample some new variable x and

y from the standard variable and then we'll make ti from this central line.

By multiplying kit for element y in this way by some

standard deviation, si, and by aiding the [INAUDIBLE].

So this way, the distribution of this expression

of this epsilon i times si plus mi is the same as,

it's just q, it's the same as [INAUDIBLE] ti.

So instead of sampling ti from this distribution q,

we can state sample epsilon and then apply this deterministic function g.

With this multiplying ys and

adding m to get the sample from the actual distribution of ti.

So we're doing a change of variables.

Instead of sampling from ti, we're sampling from epsilon i and

then converting it to a sample from ti.

And now we can change our objective, we can look at the objective and

instead of completing the integral with respect to the distribution q.

So expect the distribution q,

we can now complete the expected value with the distribution epsilon i.

And then instead of ti, use this function of epsilon i everywhere.

And this is an exact expression, we didn't lose anything,

we just changed the variables.

So instead of considering distribution on ti, we're considering distribution

on epsilon i and then converting these epsilon i samples to samples from ti.

And now this g, this function that converts

epsilon i to tis, it depends on xi and on phi.

And to convert your epsilon i, it passes your image xi through

a convolutional neural network with parameters phi.

And this si and mi, and then multiplies epsilon i by si and [INAUDIBLE] mi.

This is [INAUDIBLE] function, licensing of one.

And now we can push the gradient sign inside the expected value,

so past the probability of epsilon i because [INAUDIBLE] doesn't depend on phi.

It doesn't depend on the parameters, we are differentiation with respect to.

And this means, that now we have an expected value of some expression.

Without ever introducing some artificial

distributions like in the previous video.

We'll like obtain the expected value naturally.

And now these expected values with respect to the to the distribution epsilon i,

which is just standard normal without any parameters.

And now we can approximate this thing with a sample from standard normal.

And so ultimately we have re-written our objective,

so the gradient of our objective with respect to phi.

Is sum with respect to objects of expected value with respect

to standard normal of the gradient of some function.

Which is just standard gradient of your whole neural network,

which defines you the whole operation [INAUDIBLE].

Andnow you can redraw this pictures as follows.

You have an input image x.

You pass it through a convolutional neural network with parameters phi.

You compute the regional parameters m and s,

then you sample one vector from standard normal distribution epsilon.

And then you use all these free values, m, n, s and

epsilon to deterministically compute ti.

And then you put this ti inside this second convolutional neural network.

So when you define your model like this, you have

only one place where you have stochastic units.

This epsilon i from standard

normal distribution.

And this way, you can differentiate your whole neural structure

with respect to phi and w without trouble.

So you're going to just use tender flow and

it will find you gradients with respect to all the parameters.

Because you don't have now some different shapes with through dissembling,

dissembling is kind of outside procedure, it's just yes or

nos to determine each functions.

And this is basically implementation of theory we have just discussed within this

urban mutualization.

And now we're going to approximate our gradients by just assembling one point,

and then using these gradient of law of this complex function.

And this complex function like log of p of xi given g and

w is just this full neural network with both encoder and decoder.

So to summarize, we have just get the model that allows

you to fit probability distribution, like p of x,

into a complicated structure of data, for example, into images.

And it uses a model of infinite mixture of Gaussians.

But to define the parameters of these Gaussians, it uses a variational

neural network with parameters that are trained with variational inference.

And for learning, we can't use the usual expectation maximization because we

have to approximate.

And we can't also use variational expectation maximization because it also

[INAUDIBLE].

So we draft kind of stochastic version of variational inference.

That is applicable to, first of all, large data sets,

because we can use mini batches.

And second of all, it's applicable to the small, so you couldn't

have used the usual variational inference for this complicated.

Because it has neural networks inside and every integral is intractable.

And the model with that is called variational autoencoder,

it's like the plain usual autoencoder but It has noise inside and

uses [INAUDIBLE] regularization to make sure that noise stays.

That the [INAUDIBLE] chooses the right amount of noise to use.

And can be used to for example, generate nice images or to handle missing data or

to find [INAUDIBLE] in the data and stuff like that.

[MUSIC]