[SOUND] So in this video, we are going to discuss,

not in full details, but in some kind of hand wavy way of how to apply

Algorithm to the probabilistic principal component analysis.

So let's look at the E-step first.

On E-step we have to set q to be the Gaussian

distribution on our latent variable t.

So ti given xi and theta.

And in general, it's proportional to what the join distribution.

So to the x,q and t times q of t divide by some normalization constant.

And in general this normalization constant is hard to

compute because it's an integral of respect to xi.

But in this case It will not cause us any trouble because

we were able to compute normal values because everything is normal.

Everything is conjugate and this link will also be normal and

you can look up the formulas for mutil in Wikipedia.

So E-step is easy.

We can easily find everything analytically.

What about the M-step?

On the M-step we're trying to maximize the expected value of

the logarithm of the joint distribution.

So we can rewrite this formula as, we can swap summation and

expected value, and we'll get some respect to in the data set.

Expected value, respect of q of qi,

of logarithm of our p of x and q of i, times q of i.

What is this?

Well, p of q, q of i is some normalization,

times normal distribution, which is something on the exponent.

And p of j is also some normalization times exponent and

some normal distribution.

So first of all we can notice that since logarithm of product is sum of logarithms,

and since the normalization constant z doesn't depend on qi at all,

then we can rewrite it as follows.

So we don't have to find the expected value of logarithm of 1 divided by z.

Because it doesn't depend on qi, and

it's expected value just equals to logarithm 1 divided by itself, right?

Then the second term here is expected value of logarithm of sum exponents.

And the second exponent is the prior.

It's minus 2i squared by 2, divided by 2.

In the one dimensional case, so if qi is one dimensional.

And the first exponent is some quadratic function of qi,

which basically mirrors the distance between the actual point xi and

the projection of the ti into two dimensional space or

most dimensional space in general.

So, this thing if you calculate it,

it's some quadratic function with respect to Gi.

So logarithm of this exponent products is quadratic function with respect to gi.

And in general, computing the expected value here can be hard.

But in practice, q is normal distribution, and

expected value of some quadratic function with

respect to a normal distribution is not that hard.

So we're going to easily analyze and compute this expected value.

Then what is left is some actually concave

function in respect to parameters.

Because from this expected value of qi,

or expected value of gi squared we'll get some just constant respect to parameters.

And we'll have to maximize in respect to parameter's data.

And this can be done analytically so

we can just compute the gradient and set it to 0 and compute the theta.

So this was kind of hand wavy explanation why Is not hard in this case.

And as we already discussed,

this probabilistic formulation of PCA allows you to do a few cool things.

So first of all you can extend you mole to be able to handle missing data.

Might just computing this missing values to be latent tables and

then applying this And you have to just extend the scheme a little bit,

to handle these new latent variables.

While second of all, it gives you a straightforward way to build

an iterative scheme, to compute the PCA.

So even if you don't have missing values, in some cases,

if you have for example, really high-dimensional space originally.

And you want to project into a really small dimensional space,

the PCA can be slow and it can be fostered to use into some iterative scheme.

And of course there are several iterative schemes for

doing PCA from linear algebra coming into.

But it's resulted, well you have to know linear algebra to derive them.

You have to think carefully and it's not that straightforward.

But expectation maximization just gave you

a straightforward way to compute some iterative scheme.

Which can be more efficient than the original PCA in some cases, like when you

have much general amount of the original data than the lower general soft space.

So you don't have to think that much, you can just apply the And see what happens.

Another cool feature of probabilistic PCA is now that you have this base

line probabilistic model, you can do a mixture of them,

if the flexibility of your original PCA is not enough.

So you can easily extend your model to be more flexible, and

then everything is not on the display any more.

But you're still going to still apply And compute some solutions, so

string the model.

And finally, and it's sometimes helps with tuning

hyperparameters to treat everything probabilistically.

Because it allows you to tune the hyperparameters by tracking the low

likelihood recommendation set, which is sometimes not feasible

with non-probabilistic models with unsupervised data sets.

And for example you can choose within using the full covariance matrix sigma and

using just diagonal approximation of that.

So they have much different, very different number of parameters.

And you can choose between these two by just considering the local

likelihood on the relation set.

And in principle you can

improve your solution

to this unsupervised problem.

[SOUND]