0:00

[SOUND] In the previous model, we covered the expectation

Â maximization algorithm for Gaussian mixture model.

Â In this module, we'll talk about the channel form of expectation maximization,

Â which will allow you to train almost any latent variable model you can think of.

Â So it's going to be intense, but going to be worth it in the end.

Â But to start, we will need a few mathematical tools to work

Â with inequalities we're going to derive.

Â So the first tool is connected to concavity.

Â And [INAUDIBLE].

Â What is concave function?

Â Well a concave function f is a function for any two points, a and b.

Â And for any point in between them, it has a property that the value

Â of the function in this point, the blue point, is greater than or

Â equal to the value of this segment that connects f(a) and f(b).

Â So the orange point.

Â Or if you put them all formally, for any two points a and

Â b and for any weight alpha from 0 to 1,

Â if you mix these two data points with a weight while alpha,

Â you will get a point in the system.

Â And f of this mixture has to be greater than or equal to the mixture of f's, so

Â the blue point has to be greater than or equal to the orange point.

Â Or yet in other words, for n two points a and

Â b, the whole segment that connects f(a) and

Â f(b) should lay below the function f.

Â So if you have a concave function like logarithm for example,

Â then you can prove the kind of the same property but for more points.

Â So for example for any free points, and for any free weights alpha,

Â which are non-negative and sum up to 1.

Â We can prove that this concave function, so f of mixture of these free

Â points with weights alpha, is greater than or equal to the mixture of f's.

Â And you can generalize this thing.

Â So first of all, you can call this alpha's because you know why not.

Â They are non-negative phase sum up to 1.

Â Let's say that there is a random variable g which takes three values,

Â a1, a2 and a3 with probabilities alpha.

Â And now we can rewrite this inequality.

Â Think that f of the expected value of t is greater than or

Â equal to the expected value of f of t.

Â And just the same thing is the definition of the expected value.

Â So the average with weights being the probabilities,

Â and this is something called Jensen's inequality.

Â Which if you summarize everything, [COUGH] states that for any concave function f and

Â for any probabilistic distribution p of t, the f of the expected

Â value of t is greater than or equal to the expected value of f of t.

Â And this holds true even if you have free points.

Â Or if your t takes arbitrary to many points like infinitely many.

Â And in this case, your origin can turn to be an integral,

Â but the inequality still holds.

Â So, f of the expected value is greater than or equal to, expect a value of f for

Â any concave function f.

Â And the final mathematical concept we will need is something called

Â Kullback-Leibler divergence,

Â which is a way to measure difference between two probabilistic distributions.

Â So say you have two Gaussians.

Â One of them is located at 0, and the other one is located at 1.

Â And both has variance 1, okay?

Â So how to measure how different these two Gaussians are.

Â Well natural thinkers do measure the distance between their parameters,

Â and it's one here because the variances are the same, and

Â the locations are different by 1.

Â Well it may be a reasonable measure of distance here, why not?

Â But let's consider other two Gaussians which has the same locations,

Â 0 and 1, but they have variances 100.

Â Both of them have variance 100.

Â In this case, the difference between parameters is the same.

Â It's 1.

Â But intuitively, it seems like these two Gaussians, the green and

Â the red one, are much closer to each other than the first two.

Â So can we build a better way to measure difference between Gaussians or

Â between any probability distributions?

Â Well, Kullback-Leibler divergence is something which tries to solve

Â this problem.

Â And for example, for this particular distributions.

Â The KL divergence between the first two ones,

Â the blue and the orange Gaussian will be 0.5.

Â And the KL divergence within the green and red one will be 0.005.

Â So it kind of reflects our intuition that the second set of Gaussians are much

Â closer to each other.

Â So let's look at the definition of the Kullback-Leibler divergence.

Â It may look scary a little bit but the idea is really simple.

Â So what do I have here?

Â It's an integral of q(x) times logarithm of something.

Â Well integral of q(x) is just an expected value of this function.

Â So here we have an expected value of logarithm of the ratio, right?

Â And logarithm of the ratio is trying to measure how different these

Â two distributions are at the current point x in the log scale.

Â So what I have here is basically how different these two distributions are at

Â any data point, at any point of the x-axis in the log scale.

Â And then we're taking an expected value of this quantity.

Â So we're kind of averaging across the whole space,

Â across the whole line, the line of real numbers.

Â And we're getting something like the main difference between

Â these two distributions.

Â So give the definition of the Kullback-Leibler divergence.

Â It has a few properties which we'll use in the following videos.

Â So, first of all it's non-symmetric, right?

Â If you change p and q, then you'll have different expression.

Â And this is one of the reasons why it's not a proper distance

Â between distributions in the strict mathematical sense.

Â But anyway it's useful to measure some kind of a distance or

Â difference between distributions.

Â Other property we make is [COUGH] that for any two distributions that coincide,

Â that are the same, they can distance between them and 0.

Â And this is really easy to see because,

Â if you substitute p with q, in the expression above,

Â you will have expected value of logarithm of q divided by q.

Â And q divided by q is 1 at each point.

Â The logarithm of that is 0.

Â So we have an expected value of 0 which is also 0.

Â So Kullback-Leibler divergence between distribution and itself is 0.

Â And finally, the KL divergence is non-negative for any of the distributions.

Â And that's kind of easy to prove because you can use minus KL divergence.

Â We can look at the minus KL divergence which equals to the expected

Â value of the logarithm of the ratio.

Â And we can put minus inside the expected value,

Â because expected value is [INAUDIBLE].

Â And finally we can use the progress of the logarithm.

Â Minus logarithm of something is the inverse logarithm of the inverse.

Â And now we have expected value of the logarithm.

Â And as you may know, the logarithm is concave function.

Â It's kind of easy to prove.

Â You can look it up in Wikipedia or so, or just believe me.

Â But anyway, this means that we can apply Jensen's inequality here.

Â And Jensen's inequality will say that expected value for

Â logarithm is less than or equal to the logarithm of the expected value.

Â And finally, the expected value of p divided by q is just equal to

Â the integral of q times p divided by q, which where the q vanishes.

Â And integral of b is always 1 for n distribution b,

Â because it's the property of the distribution.

Â And finally logarithm of this one is 0, so

Â we have just proved that minus KL divergence is always non-positive,

Â which means that KL divergence itself is non-negative.

Â To summarize, KL divergence is some way to compare to distributions with each other.

Â It's non-symmetric.

Â It equals to 0 then we compare the distribution to itself,

Â and it's all just non-negative.

Â [SOUND]

Â [MUSIC]

Â