0:00
We talked about how the problem of learning a general Bayes net structure
can be viewed as a combinatorial optimization problem in which we are
trying to find a high scoring network structure and how that's problem, that
problem is usually solved by some kind of heuristic search over the space of Bayes
net structures. Let's talk now, about the computational
cost of this algorithm and how we can use the property of decomposability, which we
also previously used in the context of tree learning to considerably reduce the
computational cost of this of this algorithm.
So as a reminder, the heuristic search procedure that we discussed iteratively
moves over the space of networks. And so if this is the network that we're
currently at, we need to evaluate several different moves away from that network
that [INAUDIBLE] usually local ways by adding, deleting, or reversing an arc.
We then score each of these subsequent networks and see which of those has the
highest score and take that move. And one can also have other algorithms
that evaluate that, that don't necessarily make the greedy move at each
point, but basic idea is the same. So what is the computational cost of this
algorithm? Let's do the naive computational
analysis. what we would get if we were to do the
naive implementation of this. So, how many operators are there in each
search step that we need to evaluate? so in a Bayesian network with n nodes, we
have n(n - 1) different possible edges.
[SOUND] One now, each of those edges is either
present in the graph or not present in the graph.
If the edge is present in the graph, we can either delete it,
[SOUND] reverse it. And if it's absent,
we can add it, which means that effectively, we have
either two or one possibilities for each of those n(n - 1) possible edges.
And so, we have O(n^2) possible operators that we need to evaluate at each point in
the step, each, each step in the algorithm.
What is the cost of evaluating each of the candidate successors that we would
get by taking one of these operators? So, reminding ourselves that there are
multiple components in the scores, one for each variable in the network,
because of the decomposability property. So we have n different components, and
for each of those, we have to look at the sufficient statistics and compute the
resulting entry in the score corresponding to that variable.
Computing sufficient statistics requires a traversal over the training data and so
that's something that takes O(M) time where M is the number of training
instances. So altogether, this step requires O(n *
M). Now, we haven't talked about this, but
one also needs to make sure that the resulting graph from this operator is in
fact acyclic and so we need to do an acyclicity check.
This is something that, in general, requires O of little m time, where m is
the number of edges in the graph. So, altogether, if we sum up all of these
different pieces of the cost, we end up with a computational cost which is O(n^2
* (Mn + m), whereby in large, little n is usually dwarfed by M, by big M times n
and that is the cost per search step. Now if you think of a network, it's not
even necessarily that large, something that has 50 or a 100 variables, so that n
is 50 to a 100, and the number of training instances is 10,000, this can
get really, really large and generally intractable to do in a lot of situations.
So how can we improve on this? Let's see how we can exploit the
decomposability property to get a significant computational savings in the
search process. Let's first look at a single small
operator such as the one where we add an edge between B and D to this network
where where such an edge did not exist before.
And, let's consider the score that we had for
the original network, which in this case is, because of decomposability property,
is the sum of family scores for the individual variables.
So we have a component that lists the score of A relative to its empty set of
parents, the score of B relative to the empty set, C relative to its parents A
and B, and D relative to its parent C. Let's compare that to the score of the
network following the move and we can see that this score for the new network is
actually very similar to the score of the original network, because, for the same
decomposability property we can break up the score into these little components
and since most families haven't changed, that component in the score is going to
be the same. Specifically, we're going to have the
exact same score for A relative to its empty parent set, the same score for B,
the same score for C, and only this only this last score for D is now going to be
different, because of the fact that we've modified the family for D.
But that immediately suggests that we don't really, really need to recompute
these earlier components of the score, because they haven't changed. We only
need to compute the last component corresponding to D.
In fact, we can do better than that by, instead of considering the absolute
score, instead, we're going to compute what's called the delta score which is
the difference between the score of the network following the change, this
network, and the score of the original network.
And we're going to compute the difference between those two scores, which as we can
see, is just the difference between the scores of the two families for D the B, C
family in, in the following, in the new network versus the C family and the
original network. And, that delta score is going to be
something that we can compute just looking at a single family, ignoring the
rest of the network. The same thing happens for other local
operators. So for example, if we consider now the
deletion operator that deletes an edge between B and C the, and we look at the
delta score, that delta score only cares about the the family C, because that's
the only family to have changed. And that's going to be, again, the
difference between the score of C with the single with the single edge A minus
the score of C with the family A, B.
So again, only one family gets affected by this change.
For an edge reversal, the situation's a little bit more complicated, because we
can see that by flipping the edge from B to C and making it go from C to B,
there's actually two families that are affected, the B family and the C family.
But that's still just two as opposed to the entire network and so we can see that
that delta score is going to have two components,
one is the delta score for the C family and the second is the delta score for the
B family. But in either case, we only end up
affecting either one family in the case of edge addition or in that case, case of
edge deletion or at most two families in the case of edge reversal.
And so, that means that we only have to consider a much smaller fraction of the
network when computing the delta square. A second use of the decomposability
property comes up when we consider multiple consecutive steps of the search
algorithm. So let's imagine that in the previous step, we decided to to take
that, step that, added the edge, from B to C, and now we have to consider the
next step in the search. Well, what are the operators that are
conceivable in this in this next step of the search? For example, one such
operator is to delete the edge from B to C [SOUND] this one.
[SOUND] And notice that that edge deletion operation is in fact an operator
that we also considered in the previous step of the search before we decided to
add the edge from B to D. Now, normally is this operator the same,
notice that the family of C hasn't changed between those between those two
cases, in both of these cases when we're considering the move C has the parents of
A and B. And so, the delta score for that particular move is not going to change
either, specifically, if in this case, the delta score was this score of C
minus, given family A minus the score of C given the family A,
B. We have exactly that same score,
the same delta score in in the previous step.
That is these two delta scores in the previous step and the new step are
exactly the same and so there's really no point to recompute it.
[SOUND] So which scores do we need to recompute?
The only scores that we need to recompute are the ones that were affected by the
step that we currently, that we just took.
So specifically, if we took a step that modified the family of D, then any step
that involves an additional change to the family of D will have a different delta
score, because the family is now different in doing the comparison.
However, families that were not affected by the move, don't need to be recomputed.
So to summarize, we only need to rescore delta moves, delta scores for families
that changed in the last move. [SOUND] So let's summarize the
computational cost of this procedure. What's the cost per move?
Having decided to take a move, we have only one or two families that were
affected by that move, that means that only O(n) delta scores
need to be computed, because for a given family, there is only n possible edges
that that impinge on that family. So only O(n) delta scores need to be
computed, and for each of those, we need to compute sufficient statistics which
takes O(M) time. So altogether, we have O(n * M) as the
cost of doing this step, which is actually two orders of magnitude better
than the old n^3 * M that we had for the original naive computation.
Now this tells us the cost after we pick a move.
What about the cost of picking a move? Now, naively, we might say that there is
n^22 possible operators that we can consider any in given move,
and so, we need to evaluate each of them, and pick the best one or consider each of
them, and pick the best one. But, in fact, we can do considerably
better by the use of clever data structures.
So specifically, we can maintain a priority queue of operators sorted by
their delta scores. Now, when we rescore those O(n)
operators, in this step over here, we need to modify
the score of those operators and reinsert them into the priority in their
appropriate place. But that's a computation that requires O(n log n) time
because there's only n of M. And, once we have done that, the best
operator will be at the top of the list which we can then take identify and take
in constant time. And so, this priority queue saves us
complexity by taking us from O(n^2) time for picking this for traversing the set
of operators to something that requires O(n log n).
And so, altogether, we've actually saved a considerable cost in both of these
steps. [SOUND] It turns out that one can get an
even higher level of computational efficiency based on a different
observation. So, it's a fact that in most network
learning algorithms, the plausible families, the ones that have some chance
of being selected are variations on a theme.
That is, for a given variable A, there might be some number of variables you
know, B1, B2, up to Bk for a very small k that are
reasonable parents to be selected as parents for A.
And so, how do we exploit that property in
computational, to get computational savings?
Turns out there's two different ways that one can utilize this.
The first is the fact that because we have the same set of variables being
constantly considered as being candidate families,
it means that we can use sufficient statistics that we computed in one step
and reuse them in a different step if we cash them,
because, because we're likely to encounter the same family more than once.
We might encounter B as a parent of A and A as a par, as a possible parent for B,
and for both of these, we have the sufficient statistics A, B that are going
to be utilized for both of them. And, so, if we compute this once and then
cache it, these sufficient statistics, we don't have to recompute them again.
That turns out to make a huge difference in terms of the computational cost of
this algorithm. The second way in which this could be
used is that if we can identify in advance the set of B1 up to Bk that are
reasonable parents to consider for A, we can restrict in advance that set and not
consider other parents at all, which reduces our complexity from O(n)n) to
O(k),k), where k is some bound on the number of plausible parents that we're
willing to consider. Now this, now is the heuristic in the
sense that this is a restriction that can actually change the outcome of our
algorithm. It's not just a way of reducing the cost,
but it turns out to be a very good approximation in practice.
To summarize it turns out that even the fairly simple heuristic structure search
that we employ in in, in such as greedy hill climbing can get expensive when n is
large, because the naive implementation has a cubic dependence on n.
But we can exploit the decomposability property, that we also exploited in the
context of tree learning, to get several orders of magnitude reduction in the cost
of the search. We can also exploit the recurrence of
plausible families at multiple steps in the search algorithm to get further
computational savings and also restrict the search space to to get better better
computational property.