0:00

One of the most elegant properties of probabilistic graphical models is the

intricate connection between the factorization of the distribution as the

product of factors. And the independence properties that it

needs to satisfy. Now we're going to talk about how that

connection manifests in the context of a directed graphical models or Bayesian

networks. So, let's first remind ourselves about

why independence and factorization are related to each other.

So, for example, the independence definition that P of X, Y is the is the

product of two factors P of X and P of Y is the definition of independence.

And at the same time it's a factorization of the joint distribution as a product of

two factors. similarly one of the definitions that we

gave for conditional independence, which is the, the joint distribution over X, Y

and Z is a factor over X and Z times a factor over Y and Z, is the definition of

conditional independence. So, once again, independence of

factorization. So, we see that factorization of the

distribution corresponds to independencies that hold in that

distribution, and the question is if we have that if so if we know now that a

distribution P factorizes over G. The question is can we know something about

the independencies that the distribution P must satisfy, just by looking at the

structure of the graph G. [COUGH] So what are independencies that

need that might hold in a probabilistic graphical model.

So we talked about the notion of flow of influence in a probabilistic graphical

model where we have for example the notion of an active trail that goes

through as up through I, down through G and up through D, if for example we have

the G is observed, that is in Z. And that gave us an intuition about which

what problistic influence might flow. We can now turn this notion on its head

and ask the question well what happens when we tell you that there are no active

trails on the graph that is influence can't flow.

So we're going to make that notion formal using the notion of d-separation.

And we're going to say that X and Y are deseparated in a graph G given a set of

observation Z if there's a lack of trail between them.

And the intuition that would like to demonstrate in this context is that this

notion of influence can't flow corresponds much more formally to the

rigorous notion of conditional independence in the graph.

So let's actually prove that that's in fact the case.

So the theorem that we'd like to state that if P factorizes over the graph,

then P factorized is over G. And we have a deseparation property that

holds in the graph. So X and Y are deseparated in the graph.

There's no active trails between them. Then P satisfies these conditional

statements, X is independent of Y given Z.

So deseparation implies independence. If, the, probability distribution

factorizes over G. So, we're now going to prove the theorem

in its full glory. We're going to prove it by example

because that example really just, it really does illustrate the main points of

the derivation. So, the assumption is that here is our

graph G, and here is the factorization of the distribution.

So, according to the chain rule, of Bayesian networks.

And this is a factorization that we've seen before.

And so now we'd like to prove that a d-separation statement follows from this

from this deriva-, from this assumption. And, and the d-separation statement that

we'd like to prove follows as an independence.

Is one that says that d is independent of S.

First, let's convince ourselves that D and S are in fact d-separated in this

graph. And so we see that there is only one

possible trail between d and s in this graph.

It goes. That instance g is not observed in this

case, and neither is l. We have the, the trail is not activated

and so they, the two are, the two nodes are de-separated and so, we'd like to

prove that this independence holds. So what is the prob, the joint

probability distribution, of D and S? Well it's the sum over this join

distribution over here. Marginalizing out the three variables

that are now D and S so GL and I so we have the sum over GL and I of this big

product that's defined by the tin roll. And as we've previously seen when we have

a summation like that one of the things that we can do is we can push in

summations into the product so long as we don't push them through terms that

involve the variable. So here for example we have we might have

a summation over L. And we can push in that summation because

the only term that involves L is the probability of L given G.

So if we push in the summations in this way we end up with the following

expression and we see that we have this summation over L here, a summation over G

and then finally the summation over I at the very beginning.

And what do we know about the summation over L of P of L given G?

We know that this, that this is a probability distribution over L and

therefore. It's necessarily sums to one, because

we're summing up over all of the possible values of L.

Once this term is one, we can look at the next term, which is this one.

And we can ask ourselves, well one is, cancels out, so probability, what is the

sum over G of the probability of G given D and I, and that too is one.

And by the time we're done, we end up with the following expression.

And the important thing to notice about this expression is that this is phi one

of D and this is phi two, oops. I two of S, and therefore because we

partitioned this joint distribution as a product of two factors, we end up with

something that corresponds directly to the definition of marginal dependence,

therefore proving that P satisfies this independent statement.

Having convinced ourselves that de-separation is an interesting notion,

let's look at some other de-separation statements that might hold in a graph.

So, one general statement that we can show is that any node in the graph is

de-separated. From it's non-descendants given it's

parents. So let's look at the variable letter.

And let's see what non-descendants, non-descendants it has that are not it's

parents. So it only has two descendants let's mark

those in blue it's this one and that one and what are the non-descendants so these

are the descendants. What about the non-descendants?

Well, here's one non-descendant, here's another, another and another.

So, those are the four non-descendants in the graph that are not parents.

So, now let's convince ourselves that, letter is de-separated from all of it's

non-descendants given it's parents and let's take just an arbitrary one of

those, let's take S A T for example and let's see if we can find an active trail

from S A T to letter that is active given the parents, which, in this case, is just

a variable grade. Okay.

So what are, what's one potential trail? So we can go this way, up.

And then back down through water, but is that trail active?

No its not active because Greg is observed and so drop with blocks a trail

in fact it will block any trail that comes in from the top.

So any trail into letter that comes in through its parents grave is going to be

blocked by the fact that grade is observed.

So that means we have to come in through the bottom.

So let's look at that. So let's look at the trail from s a t,

through job. An up, again through letter, well is that

trail active, no. The reason it's not active is because

only great is observed, and job, since it's not, it can't be a parent of, our

letter, because it's a descendent, is necasserily not observed and so neither

job nor happy, are observed and so we, we can not activate, this V structure, over

here, can not be activated and so, trails that come in from the bottom, don't work

either. And so we again this is a proof by a

demonstration but it shows again that that.

It shows the general property in a very simple way.

So that tells us by the theorem that we proved on the previous slide that if P

factorizes over G, then we know that any variable is indepedent.

Of it's non-descendants given its parents, which is a kind of nice

intuition when you think about it. Because when we motivated the structure

of Bayesian networks we basically said that what makes us pick the, the parents

for a node is the set of, of variables that are the only ones that this variable

depends on. So the parents of are the variables on

which depends directly. And this gives a formal semantics to that

intuition. That is we now have the variable is

depending only on its parents. So, now that we've defined a set of

independencies that hold for any distribution that factorizes over a

graph, we're now going to define a general notion, which is just a term for

this. So we define d-separation in a graph G

and, and we showed that when you have the d-separation property, that P satisfies

the corresponding independence statement. And so now we can basically look at the

independencies that are implied by the graph G.

So I of g which is the set of indepedencies that are implicit in a

graph G are all of the independent statements X is independent of Y given Z

that correspond to d-seperation statements within the graph.

So this is the set of all indepedencies that are derived from d-seperation

statements and those are the ones that the graph implies hold for distribution P

as we just showed in the previous demo. So, now we can give this a name.

We going to say that P if P satisfies all the in-dependencies of G, then G is an

I-map of P, and I-map stands for in-dependency map.

And the reason its an independancy map is by looking at G it's a map of

independancy as a hole in p. That is it maps independancies that we

know to be true. So let's look at examples in imap, so

let's look at two two distributions here is one distribution p1.

11:11

One is G1 that has D and I being separate variables.

And the other, is G2 that has an edge between D and I.

So what independencies does G1 apply? What's an I of G1?

Well I of G1. Is the independence that says d is

independent of I, because there's no edges between them.

What's I of g2? One of, IT2 implies no independencies,

because, there are no, because D and I are connected by an edge and there's no

other independencies that can be implied, and so, I of G2 is named B7.

So what does that tell us, let's look at the connection between these graphs and

these two distributions, is G1 an I map of P1, well, if you look at P1, you can

convince yourself that P1, satisfies, D is independent of I, J1 isn't I map.

Fp1. What about G, is G1 an I map of P2?

Well, if we look at P2, we can, again, convince ourselves by examination that P2

doesn't satisfy. D is independent of i.

So the answer is no. this, g1 is not an I map for p2.

What about the other direction, what about g2?

Well, G two has no independent assumptions, and so it satisfies, and so,

it both P one and P two satisfy the empty set of independence assumptions.

As it happens P one also satisfies additional independence assumptions but

the definitions of Imap didn't require that that G, that a graph exactly

represent the independents in, in in a distribution, only that any independence

that's implied by the graph holds for the distribution, and so we have.

The G2 is actually an I-map for both P1 and P2.

13:54

That's by knowing that P factorizes over G.

Interestingly, the converse of this theorem holds also.

So this version of the theorem says that if g is an imap for p, than p factorizes

over g. So what does that mean?

It means that if the graph, if the distribution satisfies the independence

assumptions that are implicit in the graph, then we can take that distribution

and represent it as a Bayesian network over that graph.

So once again, let's do a proof by example.

So now here we have the little graph that we're going to use.

And and so now what do we know? We're going to write down the probability

of the join distribution over the, these five random variables using this

expression. Which is the chain rule for

probabilities. Not the chain rule for Bayesian networks,

because. We don't know yet that this is

representable as a Bayesian network. We're going to write the chain rule for

probabilities. And that has us, writing it as P of D

times P of I given D. Which, if you multiply the two together

using the definition of conditional probability.

These two, together, give us the probability of I, D.

If we multiply that by G, given IND. Then we end up with the probability of G,

I, D. And so on, we can construct this

telescoping product. And altogether we, by multiplying all

these conditional probabilities we have a we have a, the def-,

we can reconstruct the original joint probability distribution.

Great, now what?

So we haven't used any notions of independence yet.

So where do we use that? Well, we're going to look at each one of

these terms. And we're going to see how we can

simplify it. So P of D is already as simple as it

gets. So now, let's look at, the next term.

Which is probability of I given D. And let's look at this graph.

And we can see that I and D are non descendants.

And we're not conditioning on any parents and so we have that in this graph as and

we've already shown this that I is independent of D.

And so we can using one of the definitions that we have of conditional

independence, we can erase this from the right hand side of the conditioning bar

and just get P of I. The next term is already in the form that

we want is so we don't need to do anything with it but now let's look at

the probability of S given D, I and G. So here we have the probability of a

variable S, given one of its parents I and one of its non-descendants D and G.

And we know that S is independent of its non-descendant given, non-descendants

given its parent I and so we can erase both D and G and so end up with the

probability of S given I. Finally, we have the last term,

probability of L, given D, I, G, and S. And here again, we have the probability

of a variable L, given it's parent G, and a bunch of long descendants, so we can

again, erase the long descendants. And so all together that gives us exactly

the expression that we wanted. And so now, again, by example we proved

this we proved this independent, we proved that from these independent

statements we can derive the fact that P factorizes using the product rule for

base E networks. To summarize, we have provided to

equivalent views of the graph structure. The first is the graph as a

factorization, as a data structure that tells us how the distribution P can be

broken up in to pieces. So, in this case, the fact that G allows

P to be represented as a set of factors or CPD's over the network.

The second is the Imap view, the fact that the definition that the, that the

structure of G and code independencies that necessarily hold in P.

Now what we've shown is that each of these actually implies the other.