0:00

So far we've talked a lot about the representation of probabilistic graphical

models, we define different classes of probabilistic graphical models, Bayes

net, Markov net and talk about their independence assumptions.

Now let's operationalize probabilistic graphical models and figure out how to

use this the clarative representation to answer actual queries.

It turns out there is many queries that one can use a PGM to answer but perhaps

one of the most commonly used one is what's called a Conditional Probability

Query. So let's define a Conditional Probability

Query. In a Conditional Probability Query we

have some set of observations. E, little E, about a set of variables,

big E. These are the variables that we happened

to observe. We also have a particular query that we

care about, which is, usually which we are going to denote as a set of

variables, Y. So our goal here is to compute the.

conditional probability of the variables Y given the evidence equals little E or a

conditional probability. This type of query is useful for a

variety of different applications. For example, in the medical or fall diagnosis

domains that we've spoken about we might have observations about certain symptoms,

or test results and we're interested in predicting the probability of different

failure modes or different diseases. In the pedigree analysis example that we

also talked about we might have observations about the phenotype or maybe

even the genotype of some individuals. Family and are interested in reaching

conclusions about other individuals. So, unfortunately, like most interesting

problems the problem of inference and probabilistic graphical models is NP

hard. So, before we talk about that, let's

first remind ourselves what NP hardness is.

I'm not going to define it formally in this course.

But a, as a rough intuition, if a problem has been shown to be NP hard, it means

that it, it extremely unlikely to have an efficient solution.

Slightly more formally, it means that all algorithms that people have devised for

this problem, and a bunch of problems that are equally hard, require at the

very least time exponential in the size of representation of the problem, which

means that it's unlikely to be solvable for any problem that is bigger than a, a

small handful of say variables in our context.

So which particular problems in the context of PGM inference or MP hard?

Well, first, the basic problem of exact inference in the PGM is MP hard.

So given a PGM, P of Pi defined by a set of factors phi.

And a particular target variable x. And a value, little X.

Computing the probability of the event, Xlittle equals X, is MP hard.

In fact even a simpler special case of this problem.

One where we just want to figure out whether is probability is even positive.

That too is [INAUDIBLE]. So one might then say well okay, exact

inference is is intractable. But what about if we compromise a little

bit on accuracy. What if we're just looking for an

approximate answer. After all, we don't necessarily care

about the tenth significant digit in this probability.

Does that make the problem easier? Unfortunately, turns out that the answer

is no. And there is many different variants

empty hardness results for approximate inference.

This is just one of them. This one says that, again, given a PGM,

an event X equals little X and some set of observations.

The problem of finding. A, an approximate answer P.

For which we are guaranteed that this approximate answer is within epsilon of

the truth, is empty heart. And this is true for epis, for any

epsilon that's less than 0.5. And note that epsilon equal to 0.5 can be

obtained by simply random guessing. Or guessing the value equal to 0.5.

So any non-trivial approximation for this kind of Conditional Probability Query is

also intractable. It's also a rather depressing

observation, and might want, might want to make us give up on using PGMs for

inference. But it turns out that one shouldn't get

depressed too quickly, because importantly, this is a worst case result.

Which means we can construct these bizarre PGMs for which exponential time

is the best we can do. But that doesn't mean that in the general

case one can't do better. And the rest of the inference section of

this course will be devoted to showing a variety of algorithms that whether for

exact inference for approximate inference, can do considerably better

than this worst case result can suggest. So, first let's to understand where these

results might be getting their, where these algorithms might be getting their

power, let's drill, drill down into the inference question a little bit more.

And here is a slightly elaborated version of our student network where we now have

additional variables. For example, the coherence of a course

influences its difficulty and a variety of other variables that we're not going

to talk about. So.

Difference in, this model and in general inference for probabilistic graphical

model uses, the notion of factors that we've talked about before and it turns

out that it's convenient to use factors because it means that the algorithms

apply equally well. To bas net and mark up networks, and, and

it's a very useful extraction. So let's think about this, basion network

in the context of a set of factors, so here for example we, had, initially, a

prior probability over the C variable P of C, that's, translates into a factor,

whose scope is C. We have, for example, here Q of G given I

and D, and that translates into. This factor whose scope is g, I, and d.

And in general each one of the CPDs in this network converts to a factor over

the scope of the family that is the variable and its parents.

So now let's assume that our goal is to compute the probability of the variable

J, and so we'd, we'd like to compute p of j.

And what we see here is the joint distribution.

So this is the joint distribution which we have defined using the chain rule for

Bayesian networks. In order to compute P of J what we need

to do is only to eliminate or marginalize out all of the variables except for J and

so we end up with a summation that looks like this which is why this problem of

inference of con of conditional probability inferences often called sum

product because we have a sum over product to factors.

7:04

The inference problem for mark of networks takes exactly the same form so

here we have once again a product of factors and in this case the factors are

the formen which the network are is originally defined.

So we have the factors fi one over AB fi two fi three and fi four and if we are

interested in computing the probability P of D then once again we need to compute

the product of these factors and then marginalize out.

Now but what I wrote here is not actually quite right because the pro the product

of the factors isn't actually in mark of networks P of ADCD rather its P tilda of

ABCD which is the unnormalized measure and in order to get the normalized

measure we need to norm we need to normalize or.

Divide by the partition function. So, how do we deal with that if our goal

is to compute p o d? Well, the, point is that if what we have

computed, if we ignore the partition function, and rather we compute p

[INAUDIBLE] of d. We can infer that p of d is actually

equal to one over Z, p [INAUDIBLE] of d. Because the normalizing constant is a

constant. And so if we've computed p [INAUDIBLE] of

d, we can obtain p of d, by simple process, of re-normalization.

Of p tilda of d. What about evidence?

Well, evidence it turns out can be done by simple pre processing step of factor

reduction. So here if we're trying to compute the

probability of a set of variables y given evidence.

That, by definition is the ratio between the probability of y and the evidence,

divided by the probability of the evidence.

And, if we look at the numerator of the expression over here.

And define a set of variable, define w to be all the variables that are neither

query. Nor evidence.

9:26

Then, we can, once again, consider this as a sum-product expression.

So, P of Y and E. We are going to introduce the variables W

into this expression. So this probability over here is the sum

of this probability marginalizing out the W.

Now if we now ugh rewrite this expression.

Over here we can view it as a product of factors.

And that's case whether it's a Bayesian network or a Markoff network.

And that pro, product of factors is only those factors which are only the

components of those factors that are consistent with my evidence E equals

little E. Which means I.

reduce the factors by the evidence. So to understand what that means let's

look at this example over here. and let's imagine for example, that we

have an observation Say A equlas little A and we want to

compute the probability of the distribution in the context of this

observation. What that means is we're going to take

every single one of those factors say A equals say A equals A0. We're going to

remove from every factor that involves a, the entries of course correspond to A

equals A1. because those are not going to be

consistent with our observation A equals A0.

Once we've reduced those factors to be consistent with those evidence, we have

now a, still a product of factors and we can go ahead and treat it in exactly the

same way as we did before. And so we now have a sum over the

variables W that need to be eliminated of the product of the reduced factors and

once again we can ignore the partition function by computing this probability

and then re-normalizing at the end. , This applies equally well in the

context of Bayesian networks. So here we might have, say the

observation is I equals little I and H equals little H.

So now this is no longer an equality, but rather a because we have not yet

conditioned on the evidence. And so if we want to make this equal we

have to reduce each one of the factors involving I and H, based on the evidence.

And so this turns into phi I of little I. Which, as it happens, is a constant

because there's no other variables in the scope, and similarly here.

And here and for the H equals little H. We do the same thing, which in this case

involves only this factor over here. And now, it's back to being an equality.

And if we want to compute probability of j, given.

12:42

I equals little I and H equals little H, we take this summation.

And re-normalize. Just like before, so to summarize the sum

product algorithm can be done as follows. It looks at, you convert the conditional

probability into a ratio that the numerator of this.

Ratio is a product of reduced factors summed out over the remaining variables.

the nominator. Of this is simply the numerator, summed

up over the variables, over the query variables.

Why? And if we divide these two together, we

realize that we can get away with computing simply this product of reduced

factors, and normalizing at the end. There are many algorithms for computing

Conditional Probability Queries. one of those involves pushing the

summations into the factor product, this gives rise to an algorithm called

variable elimination, it turns out to be a special case of a class of algorithms

called dynamic programming. And it's a form of exact inference.

A generalization of this algorithm performs message passing over a graph

that also effectively deals with summations and factor product and factor

summation. And there is many variants of this

algorithm, some which are exact and others are approximate.

Here are some names of algorithms only some which we'll have an opportunity to

discuss. And then finally there's a very different

class of algorithms that uses random sampling as the key for as the key

technique and it's sample. Complete instantiations or assignments in

a variety of different ways. And uses those assignments to approximate

the probability of, of a particular query.

So here we have. Both exact and approximate.

Methods and this is an approximate method.

And we'll talk about some of these in the remaining section of course.