0:01
We've previously shown that the graph structure encodes a set of independencies
and that that set of independencies then necessarily holds for every distribution
that can be encoded as a Bayesian network over that graph.
What we want to talk about now is the question of how to take a distribution
that has a certain set of independencies that that it satisfies and encoded within
a graph's structure. How well can we take that distribution
and capture its independencies in the context of a graphical model?
So first, let's understand what the independencies of a distribution are.
So, we're going to define this notion of I of P for a distribution, P,
as the set of all independent statements. X is independent of Y given Z,
that hold for the distribution P. So, one can write down potentially an
exponentially, large set of independent statements and each of those is going to
be either true in a given distribution P or not.
And what we're going to do is we're going to define I of P to be the ones that are
true in that distribution. Here, so these are independencies that
hold in P. And we've already seen that if P
factorizes over a particular graph G then G is an I-map for P.
Which means that every independence that holds in G, that is, it's implied by the
d-separation statements or d-separation properties in G, every such independence
also holds in P. But that doesn't mean that the converse
holds. That is, there can be independencies in,
that hold in P that are members of I or, I of P, that are not encoded by the graph
structure. So,
why did that matter? Because if we have a graph that doesn't
capture some of the independencies in P, then it's unnecessarily complicated.
And conversely, the sparser the graph, the more independencies it encodes, then
the sparser it is, and therefore, it has fewer parameters that we need to elicit
or learn. fewer edges also means that inference is
more efficient. And finally, it means that the graph is
intrinsically more informative about the properties of P.
So, we would like graphs that capture as much of the properties, independence
properties of P as possible. So, what do we want in terms of sparsity?
Something that is fairly basic that we might choose to require is what's called
the a minimal I-map. That is, we want an I-map for the
distribution P that at least doesn't have redundant edges.
So, for example, if we have a graph X, Y, where Y really doesn't depend on X in
the sense that P of Y given, say, X0 in the CPD is equal P of Y given X1,
then we could remove this edge and still have something that was an I-map for this
distribution. And so, this edge is now redundant.
And so, from our definition, that would mean that this I-map is not minimal
because you can have an I-map from which edges can be removed.
So, that might seem to be a reasonable strategy, but it turns out that it's not
sufficient in terms of ensuring that our graph is as sparse as it might be.
So, to understand why that's the case, let's look at a distribution such as that
corresponds to this example that we've seen before where the students grade G
depends on these two variables, I and B. So that, in fact, is a minimal I map for
a distribution where D and I are independent and G depends on both of
them, but it turns out this is not the only minimal I-map for this distribution.
A different minimal I-map is this one, where we have an edge that goes from D to
I, from D to G, and from G to I. Let's convince ourselves that, in fact,
this is a minimal I-map. Let's try and remove any one of those
edges, and see if we still get a I-map for the original distribution.
So if, for example, we try and remove this edge that would in, that would
correspond to a statement that D is independent, of G,
which is certainly not the case in our original distribution.
So, that one doesn't work. What about the second one?
If we try and remove the second edge from G to I, that would correspond with the
statement that D, alright, that G is independent of I given D, which is also
not the case in our original distribution.
5:37
What about the final edge? The one that doesn't correspond to an
edge in our original network, this one. And we remove that one,
that one seems the most promising of the lot but if you remove that one, it would
correspond to the assumption that D is independent of I given D.
And that, and that is exactly the dependence that is induced by intercausal
reasoning that we've seen before. So, in fact, this is a minimal I-map,
none of the edges can be reduced. There is a better minimal I-map but this
is a minimal I-map. So, minimal I-maps are not necessarily
the best tools for capturing structure in a distribution.
What we'd really like is a perfect map. A perfect map is one where the
independencies in G exactly correspond to the independencies in P, so the G
perfectly captures the independencies in P.
And sure enough, if we can get that, that would be ideal.
Unfortunately, perfect maps are often hard to come by.
So, here's an example of one scenario that doesn't have a perfect map.
This is a distribution P that is actually represented by the pairwise Markov
networks that we've seen before. So one where we have pairwise
interactions between AB, BC, CD, and AD. So, we know that this this distribution
satisfies certain independencies because of the Markov network properties
specifically we know that A is independent of C given B and D because B
and D separate A and C in the graph. And at the same time, we know that B is
independent of D given A and C. So now, that we, so let's imagine that we
have a distribution P that satisfies these independencies.
So, P satisfies this pair of independencies.
And and now, let's try and encode those independencies using a Bayesian network
as a perfect map. But let's try.
One possible attempt would be to just try and direct the edges, say, this way.
9:52
And that's it, in terms of independencies.
And so, this, in fact, is an I-map for the distribution because in this case, I
of G is a subset of I of P, which is this set over here.
But it doesn't capture all of the independencies, it only captures one out
of the two. So, this is the distribution that does not have a perfect map of a
Bayesian network. Let's come up.
Let's provide another example of an imperfect map.
And, in fact, this is a this is a distribution that doesn't have an
imperfect map as either a Bayesian network or, in fact, as a Markov network.
and this is the famous example of the XOR, which is, as we'll see, a
counterexample to many, many things. So here, we have two random variables X1
and X2, which we're assuming are binary valued and say,
each of them is, takes the value of 0,1 with the probability of 50, 50.
Why, on the other hand, is the exclusive or of X1 and X2.
Which means that Y is equal to one, if and only, if exactly one of X1 or X2 is
equal to one, okay?
So, let's look at what this probability distribution P looks like.
Here it is. we have that there's four possible
configurations that have nonzero probability.
and each of them has equal probability of 0.25.
So, X1, X2 can be either zero or one with probability 50,
50 and here's the value of Y, over here. But now, let's think about other so let's
think about what independent statements are true for this distribution.
So, most obviously looking at this graph, we see that X1 is marginally independent
of X2. But if you close your eyes on this part
of, this, of the image, and just look at the right-hand side, you'll see that
really X1, X2, and Y are all symmetrical in terms of their structure.
And so, it's not difficult to verify that we also have the X1, in fact, is
independent of Y, and that X2 is independent of Y.
And all three of these are pairwise independencies hold in this distribution.
And so, this is not, in fact, the graph on the left is not a perfect map for this
distribution because their independencies that hold in P are not visible in the
graph. And, in fact, you can organize the nodes
in this graph in any which way and, but you cannot get all through these
independencies captured at the same time because the only way to do that, would be
to have X1, X2, and Y be separate variables and, of course, that wouldn't
be an I-map for the distribution. Now, we've talked about Bayes nets as a
perfect map. What about Markov networks as a perfect
map? The definition here is the same, except
that we've replaced G by H, so a Markov network H is a perfect map if the
independencies included by H are in, exactly correspond to the independencies
in P, and can we capture all possible distributions in terms of a Markov
network, is a perfect map. So, I'm sure you're all expecting at this
point for the answer to be no and sure enough, that's true.
so here, is an example of a distribution that has a perfect map, in this case, is
a Bayesian network, but not as a Markov network.
So, this is the famous V structure example, in this case the, difficulty
intelligence [UNKNOWN] structure. And let's think about what, in the, what we
would need to encode as, in terms of edges or, for being an I-map of this
distribution. So clearly, we need to introduce an edge
between B and G and between I and D because it's certainly not the case that
we can make D and G independent given I or vice versa.
And so, this would be the obvious I-map for this distribution.
But this example, if we choose this as a, as a candidate I-map, it would imply,
among other things, that B is independent of I given d.
And that, of course, is exactly wrong because we know that when we condition on
G, D is indepedenet, D D and I become dependent.
And so, the only I-map for this distribution is one that has all three
edges and that loses the independence that we had over here which is D is
marginally independent [UNKNOWN]. So once again, there's no perfect map for
this distribution as a Markov network. The final question that we might ask
ourselves is the extent to which a representation of a distribution is, in
fact, unique? And so, say, that we could represent the
distribution using some graphs, say, as a perfect map,
is that a unique representation? So, to understand that, let's look first
at the simplest example, the one where we have two variables X and Y. And here, we
have one graph that captures in this case, no independence assumptions so I of
G is equal to the empty set, H1.
And here is G2, it looks the same except that the edges are inverted, the one edge
is inverted, and once again, this is a different graph that has adopted the same
value set of independencies. So, we can see that here, we have two
distinct graphs that have the exact same of independence assumptions.
And because of that, they can represent the exact,
16:07
exactly the same set of distributions. Which, in this case, is all distributions
over the variable X, Y. Now, one might think of this as a
degenerate example because it's a fully connected graph but it turns out to be
not the case. There are many other situations where two
distinct graphs with edges going in different directions represent the exact
set, same set of independence assumptions.
So, for example, when we look at graphs over three random variables, it turns out
that of these four scenarios one of those is, represents a different set of
independence assumptions than all the others but all of the others are the
same. So, which is the odd man out?
Which of these following graphs does not encode the same independence assumptions
as the others? As I'm sure most of you realized, the
answer here is the V structure which is one that we've seen before.
And so, this one has the independence assumption X is independent of Z
marginally whereas, these other three have the independent assumption that X is
independent of Z and Y. So, these three graph,
again, represent the same set of independence assumptions, which is this
set over here. And the set, they also can take any
probability distribution that can represent any one of these graphs can
also be represented by another, by the other.
So, the formal notion for this, the formal term for this notion is what's
called I-equivalence. Where two graphs, G1 and G2 over the same
space are said to be I-equivalent if they make the exact same set of independence
assumptions. And in the previous example, we saw for
example that X, Y, Z is I-equivalent to say,
Y being a parent of both X and Z, and in this third one over here.
Whereas, the V structure is not I-equivalent.
Now, why is I-equivalence an important notion?
It's important because it tells us that there's certain aspects of the graphical
model that are unidentifiable. Which means that if we end up for
whatever reason thinking that say, this is the graph that represents our
probability distribution, it could just as easily be this one or that one.
So, without prior knowledge of some kind or another for example that we prefer X
to be a parent of Y, there's no way for us to select among these these different
choices. And it turns out that this is not an
exception, rather it's the rule.
Most graphs have a large number of I equal, I-equivalent variance.
And that turns out to be a complicating factor,
especially when we get to learning models from data.
So, in summary we prefer to have graphs that capture as much of the structure in
I of P as possible because they're more compact.
therefore, are easier to learn and easier to inference with.
And also provide us with more insight about the distribution.
We talked about minimal I-map as one option for sparse graphs, and we showed
that that's not a particularly good option because it might fail to capture
structure even if it's present in the distribution and even if it's
representable as a Bayes net or as a graphical model.