Let's discuss, how can we apply Markov chain Monte Carlo, to train Latent Dirichlet Allocation model, or LDA for short. We have already discussed LDA model in the previous week, on variational inference, and derived some complicated formulas to do the training in the LDA model. But now, let's see how can the Markov chain Monte Carlo be applied to the same problem, and what are the pros and cons of this approach. Recall that LDA model is a model to find topics in a corpus of documents and the idea of LDA is that, each document has a distribution over topics. For example, one document has a distribution, can be like 80% about cats and 20% about dogs. And then each topic is a distribution over words. For example, topic about cats can, in principle, can generate any word, but it is much more likely to generate words like cat or meow than others. And so, when we look at a particular document, for example, cat meowed on the dog, then, we can imagine that this document was generated as follows. First of all, we decided on which distribution over topics we will use. So here, it's 80 for cats, and 20 for dogs. And then, we decided that for each word, we must decide on a topic. So for our first word, we may flip biased coin, and with probability, 80% will be cats. So, for our first word, we happened to trade the topic, cats. And now, we have to generate the first word itself. So we know the topic, I'll have to generate the word. And now, we can generate any word from our logic tree, but we happen to trade the word, cat, as the first one, and then we repeat, and when we, again, sample topic for the second word, and let's say it again happened to be cats. Then let's say we generated the word, meow. And then for our last word, let's say we happen to generate the topic, dogs, and to generate the word, dog. Notice that for this model, the order of words doesn't matter. It's kind of a back of words approach. But anyway, they're more than flexible enough to discover meaningful topics in the documents. If you put it a little more mathematically, you can say that each document has a distribution over topics, Theta d, so our document D has a distribution Theta d, which is 80 and 20 in this case, and each word has a topic, which we don't know, it's latent verbal. But for words from one to n, Z_I is the topic of the Ith word in the document, D. And these topics are generated from the distribution Theta d. And now each word itself is generated from the appropriate topic. So the first topic is about cats, then we generate the word one from the topic cats, from the distribution for the topic cats. And this distribution is called Phi. So Phi cats is the distribution for the words for the topic cats. And thus, we generate the words themselves which we observe and use to train our model, to find all the necessary latent variables and parameters. And in the variational inference week, we discuss how to use expectation maximization algorithm to train these LDA model. So on the E-step, we tried to find the posterior on the latent variables Zeta and Theta, but it was too hard to do analytically, so we decided to approximate. We decided to find the approximation of the posterior in the family of factorized distribution, which is the idea of variational inference. And then on the M-step, we maximized some expected value of algorithm of the joint probability with respect to this approximate posterior distribution. And we spent quite a few time to derive necessary formulas and to find a good solution to define analytical formulas for E and M steps for this EM algorithm for LD. If you summarize the model we have here for LD, we have known data W, which is what's in our documents. We have unknown parameters Phi, the distribution over words for each topic, and we have a non-latent variables, Zeta and Theta. So, for each word, Zeta gives us the topic of this word and for each documents, Theta gives us the distribution over topics for this document. And kind of makes sense that we decided that Phi would be parameter, and Zeta and Theta will be latent variables, because the number of variables Zeta and Theta, depends on the amount of training data they have. So, for each document, we have to add one new Theta and a set of new Zeds. So it makes sense to make it in latent variables. So our parameterization will not grow with respect to the training data size. But Phi it has a fixed size, number of topics times the number of words in the dictionary. So weighted parameters in our variational inference approach to L.D. But let's see how can we deal with this model if we go full Bayesian. So if we treat Phi as a latent variable itself. So now we don't have any parameters told. We don't want to estimate any numbers but we want to marginalize out all the latent variables and do the predictions about some things. We will apply a Markov chain Monte Carlo for this model of full Bayesian inference for LD. Although we could have applied Markov chain Monte Carlo to the EM algorithm, but let's just use this full Bayesian model as an illustration. If we want to apply Markov chain Monte Carlo, we just need to sample all latent variables from the posterior distribution, for example, by using the Gibbs sampling. If we want to use Gibbs sampling, we will start with some initialization of all our latent variables and then we will sample the first coordinates of the first iteration of the latent variable Phi, from a conditional distribution conditioned on the initialized variables. Then we'll do the same thing for the second coordinates, and we'll condition on the dimension we just generated on the previous sub-step, and also on the stuff we have from initialization. And then we'll repeat any iterations to obtain the full Phi, and we'll do the same thing for Theta and Zed. So for Theta, we'll generate its Ith coordinates yet by considering the condition distribution on Theta_i, given all the Phis we have already generated, and all the Thetas we have already generated in the initial values for the rest of the Thetas and for the Zed. And the same thing for the Zeta. This is usually Gibbs sampling , it's really easy. I mean you've going to implement like five lines of Python and nothing complicated here. And we will do it in iterations to get a few hundreds of iterations of Gibbs sampling, wait until convergence, and then obtain a sample from the desired posterior distribution. Once we have at least one sample from this posterior, we can just use it as an estimation of our parameters. So one sample from the posterior on Phi for example, can be used as just the trade value of Phi. Or you can average a few Phis if you want, or you can maybe make predictions with average of a few Phis. So you can use the samples to do almost anything with your model. You can do that, it's really easy. But, the problem is that it will not be very efficient. Compared to variational inference, this approach will not be very fast. And let's see what can we do about it to make it faster in this particular case. The idea here is to look closely at the model, and try to marginalize something out. Basic Gibbs sampling is trying to integrate something. We're substituting expected value which is integral, with Monte Carlo samples. And, if we can integrate out some parts analytically, and then use Monte Carlo only for the rest, then the whole scheme will probably be more efficient. Let's see if we can integrate out some parts of the model analytically. This is the model, we have prior distribution Theta, and Theta is just probabilities for topics for each document. For example, Theta for the first document may be 80 and 20, 80 for cats, and 20 dogs. And we have a prior distribution. There's Theta being Dirichlet distribution, which is natural. I mean, it's a standard prior on this kind of latent variable which is probabilities for critical variable. Then we have a conditional distribution, Zed given Theta, which is at Theta itself. Some topic of some words will be, for example, cats with probability Theta cats. Let's see what can we do about this model. First of all, notice that these two distributions are conjugate, which basically means that we can find the posterior on our Theta analytically. And it's really easy, you can look up the formulas on Wikipedia, or you can just derive them yourself but nothing complicated here, because of the conjugate distributions. Then we may try to marginalize our Theta, so may try to find these marginal distributions at. And then it looks complicated, it has an integral inside, but it turns out that since we have the posterior distribution Theta, we can represent it in this way. This formula comes just from the definition of the conditional probability. If you multiply both sides by the probability of Theta given Zed, you get the usual definition of conditional probability. And now you have, on the right hand side you know all the terms. You know the likelihood, P of Z given Theta, it just fit itself. You know that prior, your fit is just directly distribution, and you know the posterior because these distributions are conjugate and you can compute it. These things are easy. Now, we can do the same thing with Phi and W. You're going to notice that the prior distribution in Phi is conjugate to the conditional distribution W, and thus, you can easily obtain the posterior distribution in Phi. In the same way you can marginalize out Phi. You can find W given Z, by using this formula without any integration because you know the posterior analytically. And finally, you can multiply P of Z times P of W given Z to obtain a normalized version of the posterior on Z given your data. And now everything that's left is you may for example, sample from these posterior distribution. Notice that, we don't know actually the posterior exactly because we don't know the normalization constant and it's really expensive to compute. But it doesn't matter if we can keep sampling. We may sample from Z anyway. And this will give us very efficient scheme which is called Collapsed Gibbs sampling because we collapse some of the latent variables and marginalized them out analytically. This approach will allow you to sample only Zs and then obtain, and thus make a more efficient procedure overall. And you may ask like, if I have samples only from the posterior on Z, how can I find the the values of Phi, for example? Something I care about. Well, because you know all the posterior distributions you cannot obtain these kind of things easily from the samples from the posterior on Z. So the probability of Phi, U, and W for example, which you may be interested in, can be computed by marginalized without Z from the posterior distribution of Phi, U, and W, and Z. And this is some expected value with respect to the posterior on Z, U and W. So you can you can obtain this thing from samples, right? You can sample a few Zs from this posterior distribution and then average them to get this estimation, the posterior distribution Phi, and get the actual parameters Phi which are response to the probability which worked for each topic. If you're willing to know more details about this kind of Collapsed Gibbs sampling for LDA, look into the original paper that introduced that. It's really cool. Check out the additional reading material if you're interested. But the bottom line here is that if you run the Collapsed Gibbs sampling for the LDA, it will be even a little bit more efficient than a variational inference. Here on the plot, you can see the basically time on the X-axis or number of operations wil be formed and some measure of error on the Y-axis. The lower, the better. And you can see that the Gibbs sampling is performing a little bit better than variational inference, and also better than something called EP or Expectation Propagation, which we will not talk about in this course. The bottom line here is that we were able to obtain a really efficient scheme for solving LDA model without too much trouble of variational inference. We didn't have to derive large number of complicated formulas, we just apply the usual scheme of Gibbs sampling. It wasn't very efficient. And then we just marginalized out some parts basically which was also really easy. And we finally got an efficient scheme for LDA. One more thing here, is that generally Markov chain Monte Carlo is not very good suited for mini-batches, so for large data. It has to process the whole dataset in each situation, but sometimes it works. So, use all the current use for Markov chain Monte Carlo and use mini-batches and sometimes it fails miserably. But in this case, the authors just used mini-batches and it works quite nice here.