0:00

Now, we have tackled the problem of learning a model structure or parameters

from the case of complete data. We're now going to move to what turns out

to be a much harder situation where we're trying to learn when we have only

partially observed data. The fact that this arises in a variety of

settings. It arises when we have a scenario where

we, where some variables are just never observed.

They're hidden or latent. It also occurs where some variables are

just missing, because some measurements weren't taken.

It turns out as we'll see, that these settings provide significant challenges

both in terms of the foundations, defining the learning task in a

reasonable way and from a computational perspective, where as we'll see the

computational issues that arise in this incomplete data setting are considerably

more challenging. I mentioned latent variables.

Let's try to argue why we might care about latent variables.

So one reason is that latent variables can often give rise to the sparser and

therefore easier to learn models. So let's imagine that this is my true

network. G star, where we have three variables

leading into this variable, H, and then the three variables at the bottom, and if

all variables are binary, then this is a network that can be parameterized with

seventeen independent parameters. But now let's imagine that I've decided

that H is latent and I'm just going to learn a network over the observable

variables, which are the x's and the y's. And so what is the network that correctly

captures the structure of the distribution P over x1, x2 x3, y1, y2,

and y3? And it turns out that this network, if

you think about it, has, burst, because h is not there.

And edge, from every x to every y. And furthermore.

Because the y's are no longer conditionally independent given the x's

because there only conditionally independent given the H that I don't

observe and I have also edges between the y's directly so the spaghetti actually

turns out to look like this with a total of 59 parameters in the network.

So by dropping this one latent variable, I've created a model that is much harder

to learn. Now of course, learning a model with the

latent variables is by itself a a problematic situation but it may well

be worth the tradeoff. So the other reason why we might care

about learning latent variables is because they might be interesting.

They might provide us with an interesting characterization of structure in the

data, and I'll give you details of that in a later module but for the moment just

as a teaser, imagine that we have a data set corresponding to 3D point clouds,

that are scanned of a human body and we would like to discover from that what

are, what is the limb strcuture of the person, that, to which the scans

correspond, that is we want to identify clusters in the data, clusters in the

point cloud that correspond to body parts.

And so we want to basically end up with an output where each point has a latent

variable representing which body part it belongs to.

3:52

So, having motivated why we might care about missing data, let's think about

some of the complexities that arise. So, let's imagine that somebody gives us

this sequence over here and says, you know, here's these question marks that

correspond to missing data. How do we treat this.

And the answer is, if you don't know why these data are missing you have no ideal

how to proceed. And so to understand, let's consider two different scenarios.

The first one, is an experimenter is asked to toss the coin and occasionally

the, the coin misses the table and drops on the floor and the experimenters, you

know too tired to, to go crawl under the table to see, what happens is, so they

don't record the value of the coin in the cases where it fell on the floor.

Case two is the coin is tossed, but, the experimenter doesn't like tails.

For some reason, tails are, are, you know, give them the hebeegeebez and so

tails are not reported sometimes. Note in these two cases really should

give rise to very different estimation procedures if we are trying to learn from

this data set. Specifically in the first case, we should

probably just ignore the question marks and just learn from the sequence of

observed instances, HTHH because the other ones the fact that are missing, it

doesn't tell us anything about the point. In case two, on the other hand, we can't

really ignore the missing measurements. We need to learn from the sequence H, T,

T, T, H, T, H, because ignoring the missing values is effectively ignoring

something that is, predominantly or entirely tails, and so we would get

incorrect estimates if we just ignored them.

6:25

And zero otherwise. And so we always know whether we observe

the variable or not. And so OI is always observed.

And now we're going to add a new set of random variables, which are also always

observed. These are the variables that we're going

to call YI, which have the same value as XI, so each one has the same value space

as XI, except that there is also the I didn't get to observe it value, and so in

the real data case we, in the re, in the real scenario we basically get to observe

the Y's, we get to observe the O's and the X's are not observed.

Now, the y is our deterministic function of the xes and the o's.

So y is equal to, yi is xi when o is observed.

N is= to? When o is not observed.

So in the cases where, where, the val-, where I have oi is= to one.

I can reconstruct the value of xi. But for the cases where I, I don't have

the observation, I can't. And so this is a way of just defining

the, the observability, pattern that I have.

With this set of variables I can now model the two, the two different

scenarios that we had before. In this scenario, which corresponds to

the coin falling on the ground every once in a while, we have a separate model over

here that represents our observability pattern, and we see that a variable is

sometimes observed by chance, and that the target and observed value Y depends

on X and on O but there is no interaction between the value of the coin and whether

I end up observing it or not. By comparison, in the case where the

experimenter doesn't like fails, we see that the x, that the true value of the x

variable, effects whether it's observed or not.

And so we have edge from x to o. So when can, in which of these cases can

we ignore the missing data mechanism and focus only on the likelihood of the stuff

that I get to observe? And the answer is, one can define a

notion called missing at random. Missing at random is the way for me to

say, I can ignore the mechanism for the observability and focus only on this

place over here. So one can show that it suffices for this

question, for focusing only on the likelihood that this distribution over x

and y and o have the following characteristics that the observation

variables o are independent of the unobserved X's,

9:47

which we're going to denote h, given the observed values y, which are my data

instances. Which means that if you tell me the

values that you observe, then the fact that something may or may not have been

observed doesn't carry any additional information.

And this is a little bit of a tricky notion, so let's try and give an example.

Imagine that a doctor, a patient comes into the doctors office, and the doctor

chooses what set of tests to perform. For example, the doctor chooses, to

perform or not perform, say, a chest x-ray.

The fact that the doctor didn't choose to perform a chest X Ray probably in the

case that the person didn't come in with a deep cough or some other symptoms that

suggested tuberculous or phenomena. And therefore the test wasn't performed.

So the observation or lack there of, of a chest x ray,

the fact that a chest x ray doesn't exist in my patient record is probably an

indication that the patient didn't have tuberculous or pneumonia.

So these are not independent. So in that model we do not have the

missing it random, assumption holding because we the observe ability pattern

tells me something about the disease which is the unobserved variable that I

care about, on the other hand if I have in my medical record things like the

primary complaint that the patient came in, for example, a broken leg.

Then, at that point, given that the primary complaint was a broken leg I

already know that the patient likely didn't have tuberculous or pneumonia and,

therefore, given that, observed feature, observed variable which is the primary

complaint, the observability pattern no longer gives me any information about the

variables that I didn't observe. And, so that is the difference between a

scenario that is missing at random and a scenario that isn't missing at random.

For the for the for the purposes of our discussion we're going to make the

missing at random assumption from here on.

What's the next complication, with the case of incomplete data?

It turns out that the likelihood can have multiple, global maximum.

So, intuitively, that's almost, almost obvious.

Because if you have a hidden variable. That has two values, zero and one.

The values zero and one don't mean anything.

We could rename them one and zero and just invert everything.

And it would, basically, give us an exactly equivalent model to the one with

01, because the names don't mean anything.

And so, that immediately means that I have a reflection of my likely hood

function that occurs when I rename the variables.

And it turns out that this is not something that happens just in this case,

when they have multiple hidden variables the problem only becomes worse because

the number of local... The number of global maximum becomes

exponentially large in the number of hidden variables.

And so now we have a function with exponentially many reflections of itself,

and it turns out that this can also occur when you have missing data not just with

hidden variables. So, even if all I have are data where,

where only some occurences of the variable are missing its value even that

can give me multiple local and global maximum.

So to understand that a little bit in more depth lets go back to the

comparisons between the likelihood in the complete data case and the likelihood in

the incomplete data case. So here is a simple model where I have

two variables x and y with x being a parent of y.

And I have three instances, and if we just go ahead and write down the complete

data likelihood it turns out to have the following beautiful form which we've

already seen before where we have the product of probabilities for the

three instances and each of these can be we've admitted writing the parameters for

clarity, and that's going to be equal to here is.

The probability for theta X 0Y0 given the parameters, the second instance and the

third instance. And the point is this ends up being a

nice decomposable function of the parameters.

As, in terms of a product, which if we take the log ends up being a sum.

Is a likely it decomposes it decomposes without variables in it, it decomposes

within the CPD. What about the incomplete data case?

Lets make our life a little bit more complicated and where as before we had

these complete instances now notice that these, both of these instances have an

incomplete observation regarding the variable X.

And now let's write down the likelihood function, in this case.

Well the likelihood function, is now the probability of Y0, which is the first

data instance, times the probability of X0Y1, which is the second data instance,

times another probability Y0. So since p(y0) appears twice, we've

squared this term over here. And the probability of y0 is the sum over

x of the probability of x, y0. That you have to consider both possible

ways of completing the data, x, for the different values of x: x0 and x1.

And so if we unravel this expression inside the parentheses it ends up looking

like this, theta x zero times theta y zero x zero plus theta x one theta y zero

given x one. And the important observation about this

expression is that it is not a product of parameters in the model which means we

can not take its log and have it decompose over a parameter or the

summation because a log of a summation doesn't doesn't decompose.

And so that means that our nice decomposition properties of the

likelihood function have disappeared in the case of incomplete data.

It does not decompose by variables, notice that we have a theta.

For the x variable sitting in the same, expression as an entry from the p of y

given x cpd. It does not decompose within cpds, and

even computing this likelihood function actually requires that we do a sum

product computation. So it requires effectively a form of

probabilistic inference. So what does that imply, both of these

properties that we talked about in the previous slides?

What does that imply about the likelihood function?

Before, our likelihood function has the form of these gray lines over here.

So for example like this, this is a likelihood function of a complete data

scenario. The, when we have a multi, when we have a

case of incomplete data we're effectively summing up, the probability of all

possible completions, of, the, unobserved variables, and so, thee, overall

likelihoods function, end up being a product, of,

So end up being a summation. Sorry.

A summation of likelihood functions that correspond to the different ways that I

had to complete the data. So this end up being with this as one

such summation. So the likelihood function and the being

a sum of. Like a some these nice concave likelihood

functions, well log concave likelihood functions, but the point is when you add

them all up, it doesn't look so nice at all.

It ends up having multiple modes and and it's very much harder to deal with.

The second problem that we have, in addition to multi modality, is the fact

that the parameters start being correlated with each other.

So if you remember, when we were doing the case of complete data.

we had the likelihood function being composed as a product of little

likelihoods for the different parameters. What happens when we have an incomplete,

data scenario? So, when you look at this, you can see,

for example, that when X is not observed. So, when X is not observed.

18:54

You have an active V structure that goes from theta YX through Y, all the way to

theta X. And so, intuitively, that suggests to us

that there is a correlation and interaction between the values that I

choose for theta Y given X, and for theta X.

And when you think about the intuition for that, it makes perfect sense.

Because for example, if theta X chooses to make X0 very, very likely.

Then, most of the instances where X is unobserved will be assigned to the X0

case. And that's going to basically, have me,

, in, assign the data instances to the XO, and that is going to change the

estimates of Sayla Y given XO, relative to the Sayla Y given X1, because most of

the instances now correspond to XO rather to X1.

Lapse in the correlation between them and to see that correlation manifesting lets

look at some graphs so what we're seeing here is actually the correlation between

two entries in the CPD theta Y given X so over here we see theta Y given X0 and

here is theta one given X1. What you see here is the contour plot of

the likelihood function. It has eight data points and zero missing

measurements. And you can see that this is a nice,

product likelihood function with a nice peak in the middle.

And there's no interaction between the two parameters.

But as we start to gain more and more missing measurements you can see that the

curve starts, that, that the contour plot starts at the form.

And even with three missing measurements you can see that there is significant

interaction between the value that I end up picking for theta Y given X1, and the

value that I end up for theta Y1 given X0.

So to summarize incomplete data is actually a something that arises very

often in practice and it raises multiple challenges and issues how are how are the

missing values generated what makes them missing turns out to be very important

the fact that there is certain components of the model that are just unidentifiable

because there's several equally good solutions so if you pick the best one you

better realize that there's others that are equally good out there.

And finally the complexity of the likelihood function is another

significant complication when doing this kind of when trying to deal with

incomplete data.