0:00

Having to find a set of structured scoring criteria, we now turn to the

following kind to optimize, that structure score over a set of possible

candidate structures. And, we're first going to consider the

specific case of trees. So just as a reminder, we define a

scoring function that evaluates how well a structure matches the data.

So here is a set of different structures, and here is a graphical depiction of our

score. And we would like to search for a

structure that maximizes the score. So this is an optimization problem.

where the input is a trai, is a training set, a scoring function, and possible

structures. And the output is a network.

And it turns out that absolutely the key property for computational efficiency is

the composability of the score. Which is the decomposition of a score as

a sum of family scores, which we've seen before.

1:03

Now the first problem that we're going to focus on is learning either a tree or a

forest. So what is, what is a forest?

A forest is at most one parent per variable, so here for example is.

One-fourth. And notice that a fourth doesn't have to

be connected. That is it can be separate disconnected

components. A tree on the other hand would require

that the graph be connected so we might have this edge for example as well.

And notice that, it's, a directed tree. So why should we care about learning

trees, given that trees are so degenerate, as a you know they're so

restricted in their expressive power? Well, there are reasons, there are

different types of reasons why we like trees.

First of all it turns out the math is very elegant.

Now that might not be a convincing reason but that elegant math gives rise to what

turns out to be a very efficient optimization over the set of trees that

allows me to solve it extremely effectively even for very high

dimensional problems. The, and then, finally, the third reason

is that trees are limited expressive power, but that also implies that their

parameterization is very sparse, and because their parameterization is sparse,

it means they are less susceptible to over-fitting.

And therefore, at least in cases where m is small relative to the network

complexity. Delta 2n.

then it turns out that it can provide better generalization.

Even though it doesn't capture the full expressive power that we would hope to

have in our network. So, let's now talk about the problem of

learning force. And we're going to define a little bit of

notation. We're going to define P of I for variable

I to be the parent of XI. Or zero, in case XI has no parents.

So, note that each variable has at most one parent, so this works.

So let's look at our decomposable score. And, again, this can be any of the three

scores that we talked about because they all have this property.

So the score is the sum of score of individual families Xi and their parents.

And now the parents are very simple because we have only at most, one parent

per variable. So we're going to divide this into two

cases. The first case is those that have a

parent. And the second one is those that don't.

3:46

And so, for the nodes that have a parent, what we have is the score of XI given its

one parent, and for the nodes that don't have a parent, it's just the score of XI

as a standalone variable. Now we're going to shuffle things around

a little bit and we're going to do the following, we're going to introduce

either this expression a score of XI and then we're going to add it back in.

Over here. So now we have a sum over all variables

to know that this is I equals one to N of a score of XI isolation.

But for the XI that have appearance I've subtraction that off in the first

summation. And So I've compensated for them both

sides. The important property of this expression

is that this is the same for all trees. And so it doesn't affect the comparison

between one tree and the other. And so the only, the only expression that

I need to consider when comparing two trees is this expression over here, which

is the sum over I of the score of XI with its parent minus the score of XI without

its parent. So you can think of this as the score of

the empty network, and this, the first ex-, and the first expression, the first

summation, is the improvement that a tree gives me over the empty network.

6:51

For BIC or BDE, we know that there is a penalty score.

So, even if introducing an edge from X-j to X-I increases the likelihood, we might

end up paying for it more in the penalty score than what we gained.

And so the weight, this weight, can actually be negative.

At which point the optimal structure might actually be a forest.

That is, it might be detrimental to our overall score to make the graph fully

connected. Mostly connected, I mean, graph

connected. That is, a tree rather than a forest.

Okay. Now a second observation.

A score remember satisfies score equivalence, this is a property that we

defined before if I equivalent structures have the same score.

And all of the scores that we talked about are score equivalent.

It turns out this first score equivalence score we can show that the weight from I

to J is equal to the weight from J to I. That is we can compute this expression

over here. For J to I or for I to J, we're going to

get the exact same value, and you can see that in the likelihood square directly

because mutual information doesn't have any kind of directionality associated

with a mutual information of XI to XJ is the same as mutual information form XJ to

XI, so for likelihood square you could just see that and it turns out that for

the other cases it's also it also follows.

And so that means that we can define an undirected graph, where it doesn't really

matter where I, whether I go from I to J, or from J to I.

And because the weight is the same for both.

And so 4-score equivalent scores, we therefore get the following algorithm.

We can define an undirected graph with nodes being the variables in my graphical

model. And then I define a weight, WIJ, to be,

the max, of the score XI, between XI and XJ and zero, and remember that it doesn't

matter whether I do IJ or JI, here, we're going to get the exact same thing for

score global networks. The zero, is going to be, a way of

eliminating, negative edges so that I can optimize things, efficiently without

having to worry about negative edge cost. Now I can go ahead and find a forest or a

tree that has the maximal weight. And in, and the nice thing is it turns

out that you could use any of the standard algorithms for maximum weight

spanning trees in order to do this. So for example, either Prim's or

Kruskal's algorithm can be used to solve this and we can in fact find the solution

in prime that has all been squared which is an inevitable cost given that I'm

considering all N squared combinations in in terms of pairs of edges.

And then, if I end up having edges of weight zero that indicate that the edge

they're, potentially contribute was derived from a negative component of the

score. So I remove all of the edges whose weight

is zero to produce a forest. And that is an O of N squared time

algorithm for finding the absolutely optimal tree relative to any of these

three score for those scores that we've defined.

So let's look at an example. This is the ICU alarm network again, this

is a picture of the original of the original network and if we apply this

tree learning algorithm, we end up with the following structure.

Where the edges that have been marked in red are edges that existed in the

original network and the blue ones are edges that are spurious, that is, they

weren't in the original network. So this blue one over here, for example,

isn't in the original network. And so, we see that although many, mostly

the num, the edges that we end up with are edges in the original network, some

aren't, some are derived from, correslations that, went through indirect

paths in the original network. And the other thing that's important to

see, looking at this, is that the infered edges in the tree are intrinsically

undirected, it doesn't mean that I can't, force a direction on them, but the

direction isn't determined by my optimization algorithm and that means

that I'm not able to determine what came before what in the original graph when

I'm learning a trait. Or to summarize structure learning as is

an optimization problem over the communitorial set of space of graphs the

decomposibility property allows me to break that up as a sum of terms for

different families and that allows me to optimize in the specific case of tree

structured networks using standard maximum weight spanning tree algorithm

very efficiently in quadratic time.