Now let's look at the popular algorithm for hard clustering called K-means and see if we can connect it to the general form of the expectation maximization. So, it would be really cool to drive, it is a special case of EM and to feed it in the general framework. So hard clustering. We have a data set of points and if we want to assign each data point to some of the clusters and now we can't say that suddenly the point, it belongs to several clusters simultaneously. Now which data point should be assigned to one and just one cluster, okay? The K-means algorithm suggest you to solve this hard clustering problem in the following way. First of all, let's initialize the parameters randomly. And here are the parameters are just means so the locations of the cluster. We don't have weights by and we don't have some shapes for the clusters Sigma, just locations. And then in iterations, we're repeating two sub steps until convergence. So the first sub step is to find for each data point, find the closest cluster and assign these data points to this cluster. So each data point is assigned to belong to the closest cluster to it. And by closest, I mean according to Euclidean distance. And on the next sub step, K-means suggest you to update the parameters by finding, so for example, to update the first Gaussian, the first cluster centroid Mu_1 we'll have to find all the data points which we are assigned to the first cluster on the first sub step and then find their average which will be the center of this cluster, updated center and then we repeat these two things until convergence. So if you look carefully at this algorithm, it looks really similar to the EM, right? We also have random initializations and then we in iterations repeat two steps which also look really close to the stuff we had in the EM. First of all, we compute some property for each data point and then we update the parameters by using these properties. Let's see if we can somehow convert our Gaussian Mixture Model and EM applied to it, so to obtain exactly this K-means algorithm. So to prove that the K-means is just a special case of this EM algorithm applied to Gaussian Mixture Model or maybe to some other model. So first of all, we have to say that we don't have these additional parameters, right? So let's say that the covariance matrices of the shapes are all identical matrices which means that all the shapes of each Gaussian is just uniform circle with fixed radius and the prior weights pi are all uniform also. So they equal to one divided by the number of clusters. This way we will have only Mu as the parameter of our Gaussian Mixture Model or kind of restricted Gaussian Mixture Model and then the density of each data point given that we know the cluster, now looks like this. So, it's some normalization times exponent of point five times the Euclidean distance between x1 and Mu_c where Mu_c is the center of the Gaussian number c. You can prove that this is exactly the formula for multidimensional Gaussian, multivarian Gaussian. If you have identical covariance matrix sigma, if it equals to identical matrix. Now, let's look at the E and M steps of this restricted Gaussian Mixture Model. On the E -step, the expectation-maximization algorithm suggests you to find the minimum of the KL divergence between the variational distribution q and the posterior distribution p(t) given data and parameters. And if we are going to do it, we will find not hard assignments but soft assignments, right? As we obtained in the Gaussian Mixture Model. So for each data point this q will say with which probability it belongs to one cluster or another or one Gaussian or another. And this is not what we want. We want to derive K-means from this model. So we want this step to be hard clustering. How can we do it? Well, let's find not the best optimal q but q among some family of simple qs. So let's look at the q among all delta functions. Where delta function means that this probability distribution takes one value of probability one and all the other values with probability zero. It's really certain about what the outcome will be. So, let's imagine that the posterior distribution looks like this. So our data point belongs to the cluster number two with probability like 0.6 and cluster number one with probability 0.4. What do you think? What the delta function approximation will be? So what will be the closest q to it according to KL distance among the family of delta functions? Well, it will be a distribution like this. It will put all the probability mass into the class that had the highest probability according to our posterior. So by restricting the set of possible qs on the E-step, we're actually approximating the actual posterior distribution with delta functions. And this way the optimal q on the E-step will look like this, right? It's some delta function. So it's probability one then t_i equals to some predefined value and zero otherwise. And what is c_i? What does this predefined value? Well, it's the most probable configuration according to our posterior distribution. So, we have to find c_i which maximizes the posterior probability of latent variable t_i given the data and the parameters theta. Recall that the posterior distribution itself is proportional to the full joint distribution which is x, u and t as p of t and parameters and we also have some normalization constant but we don't care about it because we want to maximize this thing with respect to t_i or with respect to c which like respect to the value of t_i. So normalization doesn't change anything. And if you recall, this thing equals to normalization times the x given t which is exponent of minus 0.5 times Euclidean distance and times the prior and we agree that the prior will be just uniform. So it doesn't depend on c and we also don't care about when we maximize this thing. So finally, we want to maximize this expression with respect to the value of t_i and we have normalization one divided by Z and we have. pi of c which both actually doesn't depend on c. So, we can throw it away and will not change anything. And maximizing the exponent of minus something is the same as minimizing something. So we can just minimize 0.5 times Euclidean distance. And finally, we have an expression like this. So our optimal q from the E-step is the same as, is just a delta function which takes the value of arg minimum of the Euclidean distance. So the closest cluster center with probability one and it has probability zero to be anything else. Okay? So, this is exactly like we have in K-means, which means that this restricted Gaussian Mixture Model if you apply EM to it and if you restrict your possible values of the variational distribution q on the E- step to be delta functions, you will get exactly the same first sub step of the iteration. So now, let's derive the formula for the M-step on the blackboard and see how it's connected, how does the formula for the second sub step of K-means and x to the M-step of the expectation maximization apply to this restricted Gaussian Mixture Model with this particular choice of q which is delta function.