0:03

So let's see how can we improve the idea of variational inference,

Â such that it will be applicable to our latent variable model.

Â So again the idea of variational inference is to

Â maximize lower bound on the thing we want to maximize actually,

Â with respect to a constraint

Â that says that the variational distribution Q for each object should be factorized.

Â So product of one-dimensional distributions.

Â And let's emphasize the fact that each object has

Â its own individual variational distribution Q,

Â and these distributions are not connected in any way.

Â So, one idea we can use here is as follows.

Â So if saying that variational distribution Q for each object factorized is not enough,

Â let's approximate it even further.

Â And let's say that it's a Gaussian.

Â So not only factorized but a factorized Gaussian.

Â This way everything should be easier.

Â Right? So, every object has its own latent variable T_i.

Â And this latent variable T_i will have variational distribution Q_i,

Â which is a Gaussian with some parameters M_i and S_i,

Â which are parameters of our model which we want to train.

Â Then we will maximize our lower bound with respect to these parameters.

Â So, it's a nice idea,

Â but the problem here is that we just added a lot of parameters for each training objects.

Â So, for example if your latent variable Q_i is 50-dimensional,

Â so it's vector with 50 numbers,

Â then you just added 50 numbers for the vector M_i for each object,

Â and 50 numbers for the vector S_i for each object.

Â So 100 numbers, 100 parameters for each training object.

Â And if you have million of training objects,

Â then it's not a very good idea to add like 100 million parameters to your model,

Â just because of some approximation, right?

Â It will probably overfeed,

Â and it will probably be really hard to train because

Â of this really high number of parameters.

Â And also it's not obvious how to find these parameters, M and S,

Â for new objects to do inference,

Â to do some predictions or generation,

Â because for new objects,

Â you have to solve again

Â some optimization problem to find these parameters, and it can be slow.

Â Okay, so we said that

Â approximating the variational distribution with a factorized one is not enough.

Â Approximation of the factors of the variational distribution with Gaussian is nice,

Â but we have too many parameters for each object,

Â because each of these Gaussians are not connected to each other.

Â They have separate parameters.

Â So let's try to connect these variational distributions Q_i of individual objects.

Â One way we can do that is to say that they are all the same.

Â So Q_i's all equal to each other.

Â We can do that, but it will be too restrictive,

Â we'll not be able to train anything meaningful.

Â Other approach here is to say that all Q_i's are the same distribution,

Â but it depends on X_i's and weight.

Â So let's say that each Q_i is a normal distribution,

Â which has parameters that somehow depend on X_i.

Â So it turns out that actually now each Q_i is different,

Â but they all share the same parameterization.

Â So they all share the same form.

Â And now, even for new objects,

Â we can easily find its variational approximation Q.

Â We can pass this new object through the function M,

Â and for the function S,

Â and then find its parameters of its Gaussian.

Â And this way, we now need to maximize our lower bound

Â with respect to our original parameters W. And this parameter Phi,

Â that defines the parametric way on how we

Â convert X_i's to the parameters of the distribution.

Â And how can we define this with this function M of X_i,

Â and with parameters Phi.

Â Well, as we have already discussed,

Â convolutional neural networks are a really powerful tool to work with images, right?

Â So let's use them here too.

Â So now we will have a convolutional neural network with

Â parameters Phi that looks at your original input image,

Â for example of a cat,

Â and then transforms it to parameters of your variational distribution.

Â And this way, we defined how can

Â we approximate the variational distribution Q in this form, right?

Â Okay, so let's look closer into the object we are trying to maximize.

Â Recall that the lower bound is,

Â by definition, equal to the sum,

Â with respect to the objects in the data set of expected values of

Â sum logarithm with respect to the variation distribution Q_i, right?

Â And recall that in

Â the plane expectation maximization algorithm it

Â was really hard to approximate this expected value by sampling,

Â because the Q and

Â this expected value used to be

Â the true posterior distribution on the latent variable T_i.

Â And this true posterior is complicated,

Â and we know it up to normalization constant.

Â So we have to use Markov chain Monte Carlo to sample from it, which is slow.

Â But now we approximate Q with a Gaussian,

Â with known parameters which we know how to obtain.

Â So for any object,

Â we can pass it through our convolutional neural network with parameters Phi,

Â obtaining parameters M and S,

Â and then we can easily sample from these Gaussian,

Â from these Q, to approximate our expected value.

Â So now again, is a low half of this intractable expected value.

Â We can easily approximate it with sampling because sampling is now cheap,

Â it's just sampling from Gaussians.

Â And if we recall how the model defined,

Â the P of X_i on T, it's actually defined by another convolutional neural network.

Â So the overall workflow will be as follows.

Â We started with training image X,

Â we pass it through the first neural network with parameters Phi.

Â We get the parameters M and S of the variational distribution Q_i.

Â We sample from this distribution one data point,

Â which is something random.

Â It can be different depending on our random seat or something.

Â And then we pass this just sampled vector

Â of latent variable T_i into the second part of our neural network,

Â so into the convolutional neural network with parameters W. And this CNN,

Â this second part, outputs us the distribution on the images,

Â and actually we will try to make this whole structure to return

Â us the images that are as close to the input images as possible.

Â So this thing is looks really close to something called auto encoders in neural networks,

Â which is just a neural network which is trying

Â to output something which is as close as possible to the input.

Â And this model is called variational auto encoder,

Â because in contrast to the usual auto encoders,

Â it has some assembly inside and it has some variational approximations.

Â And the first part of this network is called encoder

Â because it encodes the images into latent code or into the distribution on latent code.

Â And the second part is called decoder,

Â because it decodes the latent code into an image.

Â Let's look what will happen if we

Â forget about the variance in the variational distribution q.

Â So let's say that we set s to be always zero, okay?

Â So for any M(X), S of X is 0.

Â Then the variational distribution QI is actually a deterministic one.

Â It always outputs you the main value, M of XI.

Â And in this case,

Â we are actually directly passing

Â this M of X into the second part of the network, into the decoder.

Â So this way were updating the usual autoencoder,

Â no stochastic elements inside.

Â So this variance in the variational distribution Q is actually

Â something that makes this model different from the usual autoencoder.

Â Okay, so let's look a little bit closer into the objective we're trying to maximize.

Â So this lower band, variational lower band,

Â it can be decomposed into a sum of two terms,

Â because the logarithm of a product is the sum of logarithms, right?

Â And the second term in this equation equals

Â to minus Kullback-Leibler divergence between

Â the variational distribution Q and the prime distribution P of Ti.

Â Just by definition. So KL divergence is something we discussed in week two,

Â and also week three and it's

Â something which measures some kind of a difference between distributions.

Â So when we maximize this minus KL we are actually trying to minimize KL

Â so we are trying to push the variational distribution

Â QI as close to the prior as possible.

Â And the prior is just the standard normal,

Â as we decided, okay?

Â This is the second term and the first term can be interpreted as follows,

Â if for simplicity we set all the output variances to be 1,

Â then this log likelihood of XI given Ti is

Â just minus euclidean distance between XI and the predicted mu of Ti.

Â So this thing is actually a reconstruction loss.

Â It tries to push XI as close to the reconstruction as possible.

Â And mu of Ti is just the mean output of our neural network.

Â So if we consider our whole variational autoencoder,

Â it takes as input an image X, XI,

Â and then it's our posterior mu of Ti plus some noise.

Â And if noise is constant,

Â then we're training this model,

Â we're just trying to make XI as close to mu of Ti as

Â possible which is basically the objective of the usual autoencoder.

Â And note that we are also computing the expected failure of

Â this reconstruction loss with respect to

Â the QI and QI is trying

Â to approximate the posterior distrobution of the latent variables.

Â So we're trying to say that for the latent variables Ti that are likely to cause X,

Â according to our approximation of QI,

Â we want the reconstruction loss to be low.

Â So we want for these particular sensible Ti's for this particular XI,

Â we want the reconstruction to be accurate.

Â And this is kind of the same,

Â not the same but it's really close to the usual autoencoder.

Â But the second part is what makes the difference.

Â This Kullback-Leibler divergence, it's something that

Â pushes the QI to be non-deterministic, to be stochastic.

Â So if you recall the idea that if we

Â set the QI variance to zero we get the usual autoencoder, right?

Â But why, while training the model,

Â will we not choose that?

Â Because if you reduce the number of noise inside it will be easier to train.

Â So why will it choose not to inject noise in itself?

Â Well, because of this regularization.

Â So this KL divergence,

Â it will not allow QI to be very deterministic because if

Â QI variance is zero then this KL term is

Â just infinity and we will not choose this kind of point of parameters.

Â This regularization forces the overall structure to have some noise inside.

Â And also notice that because of this KL divergence,

Â because we are forcing our QI to be close to the standard Gaussian,

Â we may now detect outliers because if we have

Â a usual image from the training data set or something close to the training data set,

Â then if you pass this image through our encoder,

Â then it will output as a distribution,

Â QI, which is close to the standard Gaussian.

Â Because they train it this way.

Â Because during training we try to force

Â all those distributions to lie close to the standard Gaussian.

Â But for a new image which the network never saw,

Â of some suspicious behavior or something else,

Â the conditional neural network of the encoder never saw these kind of images, right?

Â So it can output your distribution on Ti as far away from the Gaussian at it wants.

Â Because it wasn't trained to make them close to Gaussian.

Â And so by looking at the distance between

Â the variational distribution QI and the standard Gaussian,

Â you can understand how anomalistic this point is and you can detect outliers.

Â And also note that it's kind of easy to generate new points,

Â nearly to hallucinate new data in these kind of models.

Â So, because your model is defined this way,

Â as an integral with respect to P of T,

Â you can make a new point,

Â a new image in two steps.

Â First of all, sample Ti from the prior,

Â from the standard normal and then just

Â pass this sample from the standard Gaussian through

Â your decoder network to decode your latent code into an image,

Â and you will get some new samples of a fake silly picture or a fake ad or something.

Â