0:00

[MUSIC]

Are ice cream sellers evil?

Probably not, well, at least not all of them.

But I can totally imagine a situation where the price of the ice cream goes

up whenever the temperature outside goes up.

And if it's indeed the case, we can see a plot like this.

Here on the x-axis we have temperature, and

on the y-axis we have the price of the ice cream.

And each data point, on some particular day, we measured the temperature,

we asked some ice cream seller about his price, and

we plotted this data point in the two-dimensional plane.

So we can see that these two variables are strongly correlated here and

related to each other.

Can we exploit this closeness of this meaning of these two random

variables to each other?

Well, we may say that these two variables are so

related that you can use one to measure the other.

For example, if you want to know your temperature outside and

you forgot your thermometer and also the smart phone.

You can ask your closest ice cream dealer for his price and

compute the temperature from that.

Which basically means that these two numbers are so

related that you don't have to use two.

You can as well use just one of them, and compute the other from the first one.

Or if you put it a little bit differently, you can draw a line which

goes through your data and kind of aligned with your data.

Then you can project each data point you have on this line.

And this way, instead of two numbers to describe each data point,

you now can use one, the position on this line.

And this way you will not lose much.

So if you look at the lengths of the projections, so

how much information do you lose when you project points?

And each blue data point is projected on the corresponding orange one.

You see the lengths are not high, so you keep most of

the information in your data set by projecting on this line.

And now, instead of this two-dimensional data,

you can use a one-dimensional projection.

So you can use just the position of

this line as your description of the data point.

And it's just another way to say that these two random variables are so

connected that you don't have to use two.

You, as well, may use just one to describe both of them, and

this is exactly the idea of dimensional introduction.

So you have two-dimensional data and you project it into one dimension,

trying to keep as much information as possible.

And one of the most popular way to do it is called

principal component analysis, or PCA, for short.

And PCA tries to find the best possible linear transformation,

which projects your two-dimensional data into 1D.

Or more generally, your multidimensional data into lower dimensions,

while keeping as much information as possible.

So PCA is cool.

It gives you an optimal solution to this kind of problem.

It has analytical solutions, so

you can just write down the formula of the solution of the PCA problem, and

this analytical formula is very faster implement.

So if you give me 10,000 dimensional points,

I can return you back the same points projected to ten dimensions, for

example, while keeping most of the information.

And I can do it in milliseconds, so it's really fast.

But sometimes people still are not happy enough with this PCA,

and try to formulate this PCA in probabilistic terms, why?

Well, is usually formulating your usual problem in probabilistic terms

may give you some benefits, like being able handle missing data, for example.

So in the original paper that proposed this probabilistic version of PCA,

they try to project some multidimensional data in two dimensions,

so you can now plot this data on two-dimensional plane.

And they then try to obscure some of the data, so

introduce missing values into the data.

They thrown away some parts of the features, and

then they projected this data set with missing values again.

And you can see that these two projections doesn't differ that much,

which means that we don't lose that much information by throwing

away some parts of the data, which is really cool, right?

We were able to treat this missing values, and

the solution doesn't change when we introduce them.

So we're really robust to missing values.

By the way, the paper where they proposed this principal component

analysis is really good.

So check it out if you have time and

if you want to know more details about this model.

So let's try to drive the main ideas behind this probability

principal component analysis in the following few slides.

So first of all,

it's natural to call this low dimensional representation of your data.

So in this example, one-dimensional position of each data point,

of each orange data point, to call a latent variable.

Because it's something you don't know,

you don't observe directly, and it causes your data somehow.

So the position of your orange data point on the line, this ti,

it influences where the data point will end up on the two-dimensional plane.

So it influences the position of the observed point, right?

So it's natural to introduce this latent variable model where you have ti,

which causes xi.

And you have to define some prior for ti, and

why not to set it just to standard normal?

This will just mean that your projections,

your low dimension projections,

will be somewhere around 0 and will have variance around 1.

Which, why not?

It's nice property to have.

Now we have to define the likelihood, so

the probability of x given ti, and how does x and ti connect?

So how does this one-dimensional data and two-dimensional data is connected?

Well, if you look at the orange,

kind of orange two-dimensional x on the projection of x,

then it equals to some vector times the position of

this one-dimensional line plus some shift vector.

So we can linearly transform from this one-dimensional line

to two-dimensional space and get these orange projected points.

Or more generally, we can multiply ti by some matrix W, and

then add some bisector, b, and we'll get our orange projections, xi.

And this W and b will be our parameters,

which we aim to learn from data.

Okay, but this is orange points, right?

How can we recover the blue points, the original data?

8:05

Well, it's kind of, I don't know how to do it.

I mean, it's impossible to exactly say where the blue point will be if you

know the orange point, because you don't know how much information you lost, right?

But you don't have to say it exactly,

you can just model this on how probabilistically.

So let's say that the blue point, xi, which we observe,

is just orange point plus some random noise, which is centered around 0.

And has some covariance matrix sigma, which we'll also treat as parameter.

This way we're kind of saying that our blue observed

data points are the same as the projection of our

one-dimensional data into 2D plus some Gaussian noise.

Which means that we don't actually know where the blue points occur, but

we expect them to be somewhere around the orange points, around the projections.

Okay, so we have a latent variable model like this,

so ti causes xi, and we have defined the model fully.

So we have prior of ti is standard normal, and we have likelihood.

So xi given ji is some normal distribution also.

Now we want to train this kind of model.

So we want to find the parameters which are, for example,

maximum likelihood estimation.

Well, first of all, as usually we will assume that the likelihood is

factorized into product of likelihoods of individual objects and data points.

And the likelihood is equals to the product of these likelihoods of data

points.

And then we can rewrite this marginal likelihood of

individual object, by marginalizing out ti.

So it's the joint distribution, p of xi and ti, and then we have to sum out ti.

But previously we had sums, now we have an integral,

because this latent variable ti is continuous.

And to sum it out it means to integrate it out.

Note that in general, this integral is intractable, and it's really hard to

optimize this function, because we can't even compute it at any given point.

The integral is intractable.

So it's really cool that Algorithm allows you to optimize these kinds of functions.

Although, you sometimes you can't even compute them at any given point,

but in this particular situation, you don't need that.

So everything is normal.

Everything is conjugate here,

which means that you can analytically integrate this latent variable gi out.

So you can now do this integral, and it will also be a normal

distribution with some parameters, which you can look up in Wikipedia.

And then you have a product of Gaussians, and

you can analytically compute the optimal parameters.

So you can take the logarithm of this thing.

You can compute the gradient, and then you can set this gradient equal to 0 and

find the maximum likelihood parameters analytically.

11:16

And somewhat unexpectedly, we will find out that

the optional parameters of this probabilistic

model is exactly the same as the formulas for PCA.

So look what happened.

We started with PCA, we interpreted it probabilistically.

We found the maximum likelihood parameters for this probability model,

and they turned out to be the same as the original formulas for PCA.

It's kind of unsettling, because we spent this whole,

I don t know, ten minutes, and we didn't get anything useful from that.

We get the same formulas as PCA, but

it turns out that this probabilistic interpretation is still useful.

So here we don't need Algorithm at all,

because we have everything analytical and nice.

But if we change the model a little bit,

then we will not be able to compute anything analytically anymore.

But with We'll still be able to train it.

So let's say you introduce missing values.

You do not observe some part of your xis, then you have more latent

variables than you used to have, and then you can't find the latent,

you can't find the maximum likelihood parameters analytically anymore.

But you can still apply And this will give you some valid solution.

So in the next video,

we will talk a little bit about how to apply Algorithm in this case