which we're going to denote h, given the observed values y, which are my data

instances. Which means that if you tell me the

values that you observe, then the fact that something may or may not have been

observed doesn't carry any additional information.

And this is a little bit of a tricky notion, so let's try and give an example.

Imagine that a doctor, a patient comes into the doctors office, and the doctor

chooses what set of tests to perform. For example, the doctor chooses, to

perform or not perform, say, a chest x-ray.

The fact that the doctor didn't choose to perform a chest X Ray probably in the

case that the person didn't come in with a deep cough or some other symptoms that

suggested tuberculous or phenomena. And therefore the test wasn't performed.

So the observation or lack there of, of a chest x ray,

the fact that a chest x ray doesn't exist in my patient record is probably an

indication that the patient didn't have tuberculous or pneumonia.

So these are not independent. So in that model we do not have the

missing it random, assumption holding because we the observe ability pattern

tells me something about the disease which is the unobserved variable that I

care about, on the other hand if I have in my medical record things like the

primary complaint that the patient came in, for example, a broken leg.

Then, at that point, given that the primary complaint was a broken leg I

already know that the patient likely didn't have tuberculous or pneumonia and,

therefore, given that, observed feature, observed variable which is the primary

complaint, the observability pattern no longer gives me any information about the

variables that I didn't observe. And, so that is the difference between a

scenario that is missing at random and a scenario that isn't missing at random.

For the for the for the purposes of our discussion we're going to make the

missing at random assumption from here on.

What's the next complication, with the case of incomplete data?

It turns out that the likelihood can have multiple, global maximum.

So, intuitively, that's almost, almost obvious.

Because if you have a hidden variable. That has two values, zero and one.

The values zero and one don't mean anything.

We could rename them one and zero and just invert everything.

And it would, basically, give us an exactly equivalent model to the one with

01, because the names don't mean anything.

And so, that immediately means that I have a reflection of my likely hood

function that occurs when I rename the variables.

And it turns out that this is not something that happens just in this case,

when they have multiple hidden variables the problem only becomes worse because

the number of local... The number of global maximum becomes

exponentially large in the number of hidden variables.

And so now we have a function with exponentially many reflections of itself,

and it turns out that this can also occur when you have missing data not just with

hidden variables. So, even if all I have are data where,

where only some occurences of the variable are missing its value even that

can give me multiple local and global maximum.

So to understand that a little bit in more depth lets go back to the

comparisons between the likelihood in the complete data case and the likelihood in

the incomplete data case. So here is a simple model where I have

two variables x and y with x being a parent of y.

And I have three instances, and if we just go ahead and write down the complete

data likelihood it turns out to have the following beautiful form which we've

already seen before where we have the product of probabilities for the

three instances and each of these can be we've admitted writing the parameters for

clarity, and that's going to be equal to here is.

The probability for theta X 0Y0 given the parameters, the second instance and the

third instance. And the point is this ends up being a

nice decomposable function of the parameters.

As, in terms of a product, which if we take the log ends up being a sum.

Is a likely it decomposes it decomposes without variables in it, it decomposes

within the CPD. What about the incomplete data case?

Lets make our life a little bit more complicated and where as before we had

these complete instances now notice that these, both of these instances have an

incomplete observation regarding the variable X.

And now let's write down the likelihood function, in this case.

Well the likelihood function, is now the probability of Y0, which is the first

data instance, times the probability of X0Y1, which is the second data instance,

times another probability Y0. So since p(y0) appears twice, we've

squared this term over here. And the probability of y0 is the sum over

x of the probability of x, y0. That you have to consider both possible

ways of completing the data, x, for the different values of x: x0 and x1.

And so if we unravel this expression inside the parentheses it ends up looking

like this, theta x zero times theta y zero x zero plus theta x one theta y zero

given x one. And the important observation about this

expression is that it is not a product of parameters in the model which means we

can not take its log and have it decompose over a parameter or the

summation because a log of a summation doesn't doesn't decompose.

And so that means that our nice decomposition properties of the

likelihood function have disappeared in the case of incomplete data.

It does not decompose by variables, notice that we have a theta.

For the x variable sitting in the same, expression as an entry from the p of y

given x cpd. It does not decompose within cpds, and

even computing this likelihood function actually requires that we do a sum

product computation. So it requires effectively a form of

probabilistic inference. So what does that imply, both of these

properties that we talked about in the previous slides?

What does that imply about the likelihood function?

Before, our likelihood function has the form of these gray lines over here.

So for example like this, this is a likelihood function of a complete data

scenario. The, when we have a multi, when we have a

case of incomplete data we're effectively summing up, the probability of all

possible completions, of, the, unobserved variables, and so, thee, overall

likelihoods function, end up being a product, of,

So end up being a summation. Sorry.

A summation of likelihood functions that correspond to the different ways that I

had to complete the data. So this end up being with this as one

such summation. So the likelihood function and the being

a sum of. Like a some these nice concave likelihood

functions, well log concave likelihood functions, but the point is when you add

them all up, it doesn't look so nice at all.

It ends up having multiple modes and and it's very much harder to deal with.

The second problem that we have, in addition to multi modality, is the fact

that the parameters start being correlated with each other.

So if you remember, when we were doing the case of complete data.

we had the likelihood function being composed as a product of little

likelihoods for the different parameters. What happens when we have an incomplete,

data scenario? So, when you look at this, you can see,

for example, that when X is not observed. So, when X is not observed.