[SOUND] Hi, welcome to week three. This time we will see an algorithm called variational inference. This is an algorithm for computing the posterior probability approximately. But first of all, let's see why do we even care about computing approximate posterior. So here we see base formula that helps us to compute the posterior on the latent variables given the data. We will denote the posterior probability, f, as p*(z). So when the prior is conjugate to the likelihood, it is really easy to compute the posterior. However, for most of the other cases, it is really hard. One important case is called the variational autoencoders and we will see it in week five. In variational autoencoders, we model the likelihood as neural networks. So it would be a normal distribution of the data given that the mean is some neural network mu of z and the variance is some other neural networks, sigma squared of z. And in this case, there's no conjugacy and we can't compute the posterior using Bayes' formula. But do we actually need the exact posterior? For example, here is some distribution and it doesn't seem to belong to some known family of distributions. However, we could approximate it using the Gaussian and for most of the practical considerations, it will really be a good approximation. For example, it would match the mean, the variance and [INAUDIBLE] the shape. And so through out this week, we'll see a method that will help us to find the best approximation of the full posterior. It works as follows, first of all, we select some family distribution as Q. We'll call this a variational family. For example, this could be family of normal distributions with some arbitrary mean, and the coherence matrix that will be a diagonal one. What we do next is we try to approximate the full posterior, the star of z, with some variational distribution, q of z, and we find the best matching distribution using the KL divergence. So we try to minimize the KL divergence between the q and the p* in the family of distributions, q. So depending on which Q is left, we can obtain different results. If Q's too small, then the true posterior will not lie in it and we'll have some distribution that does not match the full posterior. And the distance between the full posterior and the distribution we approximated would be exactly a KL divergence. We selected a larger Q than the posterior could match the approximate distribution. However, for larger Qs, it is harder to compute the variational inference. For example, if we select a Q as a family of all possible distributions, the only possible way to compute the posterior would be for example the base formula, and we've already seen that it is hard. There's one problem in this approach. As we'll see later, we'll have to compute the z star at some point. However, we can't compute even at one point because we'll have to compute the evidence, the p of x. It is sometimes really hard. However, there is a nice property of KL divergence that we'll see now. So here's our optimization objective. Here's a KL divergence between our variational distribution and a normalized posterior. So we'll denote it as p hat over some z that equals to the evidence. So KL divergence is by definition is an integral of K of Z times the logarithm of the ratio between the first distribution and the second. Note here that we can take the z out of an integral. We will get the following formula. We get two integrals, first is KL divergence between the variational distribution and a normalized distribution, and some integral. Actually we can see that we can take a logarithm from z out of an integral and the thing that will have left is the integral of q of z, [INAUDIBLE]. So finally we'll have a KL divergence plus some constant. And since we optimized the subjective, we can remove the constant since it does not depend on the variational distribution. And so here is out [INAUDIBLE] objective. And in the next video, we'll see a method called mean-field approximation. [SOUND]