In the last video of the last week,

we talked about Gaussian mixture models and how

they can be interpreted as models with a hidden state.

That gives the component that generates a given data point.

In this case, model estimation amounts to estimating both the hidden state s,

and means and variances of all Gaussian components.

We also said that this problem can be solved using

the EM algorithm that iterates between two steps.

The E step estimates the hidden variables given the observed variables.

The M step maximizes the low bound

on the likelihood of data by tuning all parameters in the model.

Now, I want to walk you through a few other latent variable models,

which will eventually pave the way to

our discussion of reinforcement learning in the second lesson of this week.

First, let's talk about factor analysis.

Let's assume we have data in the form of T observations over N-dimensional vector y.

Vector analysis seeks a decompositional signal y

into a weighted sum of some hidden or latent variables X,

that are assumed to be Gaussian with zero mean and unit variance, plus,

an N-dimensional white noise epsilon and which has a diagonal covariance matrix psi.

The equation reads y equals lambda x plus epsilon where lambda is emetic of size n by k,

and x is a K-dimensional vector.

Lambda is called the factor loading matrix as it gives weights

of different factors or components of X in the final observed value.

Now, because both x and epsilon Gaussian Y will also be Gaussian with

zero mean and its variance can be expressed as lambda times lambda transposed plus psi.

Now, what is the point of such the placing of

one Gaussian variable as a sum of two such variables?

Well, one reason for this is that we see that

our decompositional y provides a compact set of model parameters.

Indeed the number of tree parameters to describe a covariance matrix over

a general and dimensional vector y is n times n minus 1 divided by 2.

But the factor model leaves on the n times k parameters for matrix lambda

plus a k more parameters to describe variances of components of epsilon.

This makes N times K plus 1 parameters which is much less than the first series.

Also, we can compute the distribution of

hidden factors in factor analysis conditional on the observed data.

If we take a low number of factors,

this can also provide the low dimensional representation of your data.

There are also some subtle points about factor analysis.

The first point is that as it stands,

the factor model does not give a unique solution for lambda and x for given y.

To see this, let's assume that

the two given variables lambda and x are such that they provide a best fit to data y.

Now, assume that we have an arbitrary orthogonal matrix U,

so that U times U transposed equals a unit matrix.

Now, because U times transposed equals identity matrix,

we can ensure that just in between of w and x in this equation.

But then, we can rename the product lambda U as

lambda hat and also rename the product lambda transpose times x as x hat.

So, we get that the factor decomposition has

the same form as above but with different parameters around their necks.

It means that lambda and x are not unique for a given y.

To resolve this, some constraints need to be added to the definition of the factor model.

Usually, the factor loading matrix seeks constraint to be orthogonal.

One motivation for this is provided by a link

that exists between factor analysis and the PCA.

Namely, if we take the noise covariance matrix to be proportional to a unique matrix,

then it turns out that

the maximum likelihood estimation of the model produces the result that

the factor loading matrix should be a matrix of eigenvectors

of covariance matrix or y, stored column-wise.

Factor analysis model estimation can be done using the EM algorithm.

In a E-step, it estimates hidden factors

x while keeping the parameters from the previous step.

In the M-step, it adjusts the model parameters

by maximizing the low bound on the log likelihood.

Another latent variable model that I

wanted to briefly discuss here is called Probabilistic PCA.

Probabilistic PCA is a special case of factor analysis when we

take the noise covariance matrix to be proportional to a unit matrix.

Such noise is called isotropic.

Probabilistic PCA provides probabilistic generalization

of the conventional PCA exactly as its name suggests.

It can come very handy in many different, practically important situations.

For example, when your data has missing values or when

only certain components of the data vector are missing at some days.

Both these situations are very commonly encountered in finance.

Probabilistic PCA can work with such incomplete data

or it can be used to fill in missing values in data.

Also, because it's probabilistic,

it has much less issues with the outliers and noise in the data than the regular PCA.

Finally, the conventional PCA is produced from the results

of probabilistic PCA if you take the limit of sigma goes to zero.

Again, the probabilistic PCA can be estimated using

the MLE for the maximum likelihood estimation.

In the E-step, the algorithm makes inference of

the hidden state while keeping the model parameters fixed from the previous iteration.

In the M-step, it keeps the distribution over hidden nodes fixed and

optimizes over model parameters by maximizing the low bounds on the log likelihood.

Okay. This concludes our quick excursion into latent variable models.

Let's quickly check what we learned and then move on.