0:00

We talked about Bayes net structural learning as the problem of optimizing a

scoring function over the space of network structures.

And we also talked about how the likelihood score which is the simplest

score that one could come up with has significant issues relative to

overfitting. We're now going to look at a different

score that's derived from Bayesian principles.

And we're going to see how that score though it has some surface similarities

to the likelihood score is a considerably better score in terms of its avoiding of

overfitting. So the Bayesian score is derived from the

Bayesian paradigm that says that anything that we have uncertainty over need to

have a probability distribution over it. And so if we have uncertainty over graphs

then we need to have a prior over graphs. And if we have uncertainty over

parameters we need to have uncertainty over parameters, a probability

distribution. And, so if we now define our optimization

problem as that of finding, the graph G that maximizes the probability of G g

given D. Now if we rewrite the probability of G

given D using Bayes rule. we end up with, the following expression.

P of D given G * P of G / P of D. So let's look at each of those terms

separately. The first of these is a term called the

marginal likelihood. P of D given G is the probability of the

data, given the graph. The second of these terms, P of G is a

prior over graph structures. And, the final term P of D, the

denominator is the, is called the marginal probability of the data.

Now, importantly this marginal probability of the data is independent of

G, and so it's not going to affect, the choice of which graph we select, and

therefore we can ignore it in the context of a model selection problem that is of

finding the single, graph or a graph that maximizes the score.

And so we're going to define the Bayesian score, score B of, G, of relative to a

data set D, as the log of the numerator of this expression, so the log of the

marginal likelihood plus the log of the prior over graphs.

Now we might think, that this score is going to avoid over fitting because of

its use of the prior over refs. And although that prior can play a role,

actually it turns out to be a significantly less important role than

the first of these terms, which is the marginal likelihood.

So, let's look at the marginal likelihood in a little more depth.

So, the marginal likelihood, p of d given g, is not as it happens, the same as the

log likelihood. Because that marginal likelihood is

something that integrates over all possible settings of the parameters.

And so from a mathematical perspective, what we're going to do is we're going to

introduce a variable, theta G into the probability expression.

And then we're going to integrate it out which is the same as summing it out.

Only it's a continuous space, so we're going to use integrals.

So P of D given G is the interval over all possible model parameterizations

theta G of the probability of d given g and theta g.

Times the probability of theta G given G. Now the first of these two terms is the

likelihood which is exactly the the component that we used in the log

likelihood score. But, importantly, unlike the likelihood

score, we're not computing P of D with, given G.

And theta G just for the maximum likelihood parameters, theta had G.

Rather we're computing this probability averaged over all possible parameter

settings, which gives us a considerably less optimistic assessment of the

probability of the data given a particular structure, because we have to

take into consideration not just a single parameter of setting that is geared

exactly toward our data set, which is the state of half set parameters, but rather

averaging out over all possible parameter settings theta G using the prior over

parameters. So, that less optimistic assessment is

sort of one intuition as to why this might not over fit as much.

But it turns out there is another perhaps even more intuitive explanation as to why

this square reduces over fitting. So let's look at the, this marginal

likelihood term, P of D given G. And just write, rewrite it as the

probability of all the instances X1 of the XM given G.

And what we're going to do is, we're going to now break up this joint

distribution using the chain rule for probabilities.

Not the chain rule for Bayesian networks. Just the chain rule for probabilities and

we can write that as the probability of X1 given G times the probability of X2

given X1 and G tomes the probability of X3 given X1 and X2 times blah, blah, blah

up to the probability of XM given all of the previous ones

X1 up to XM -1 and G. And if we look at each of these terms

each of these is effectively making a prediction over an unseen instance say XM

given X1 up to XM - 1. So you can think of it as almost doing

some kind of cross validation or generalization or estimate of

generalization ability because we're estimating the ability to predict an

unseen instance given the previous ones. And so the probability of D in some sense

incorporates into it, some kind of analysis of generalization ability.

Now you might say well surely the likelihood does the, the standard

likelihood score does exactly the same thing.

But a little bit more thought reveals that if we want to do this kind of

analysis for the maximum likelihood set of parameters that maximum likelihood set

of parameters theta hat G depends on all of D.

All of the instances D, in D, and so we can't break it down in

this way, because because D of, because the, if we had theta had G on the right

hand side, it's already incorporating into it, all of the instances including

the unseen ones, which is yet another intuition as to why that like, the, the

maximum core tends to over fit. Now,

the maximum, now the Bayesian score, might look a little bit scary, because it

has all this integral in it, and, and who knows how we would ever compute it.

It turns out that for the case of multinomial Bayesian networks the the

likelihood, the Bayesian score, can actually be written in closed form using

function called the gamma function. The gamma function is as written over

here it's also integral but it turns out that the gamma function is really just a

continuous extension of the factorial function.

Because gamma X satisfies the quality gamma X is = to x * gamma x - 1.

And most computers have an implementation of the gamma function in the.

So with the gamma function we can actually rewrite the probability of D

given G as a product that looks as follows.

The first is first we have a product over all variables i.

And then we have two different products included in this big this big outer

product. The first is a product over the parent

set UI of multiplying over all possible values of the parent set.

And here we have an expression involving gamma functions.

Where this expression involves both the Dirichlet prior parameters.

The alphas, as well as the sufficient statistics.

So the first product enumerates overall possible assignments to the parents.

And then we have an inner product, yet another inner product which innumerates

overall possible values of the variable itself.

so the variable, all possible values XIJ of the variable XI.

And once again we have gamma terms that involve both the prior, as well as the

sufficient statistics. And the, while this expression might look

still a little bit scary, it it's something that one can just plug into the

computer and, and compute quite easily. Another valuable property of this

marginal likelihood, if we look at it a little bit more and we also take its

logarithm is we see that this expression is, which was originally a product over

variables i of some, complicated expression relating to xi.

Is actually something that if we take the log turns into a sum over i of the family

score something that could be viewed as the family score for variable xi and, its

parent set of together. And so, just like other scores that we've

seen before. This this scoring function can decomposes

as the sum over i of terms involving only xi and its parents.

And we'll see that this property can be quite important from a computational

perspective. How do we sum up building up.

From the marginal likelihood the second term in this expression is log of P of G.

And so in order to accommodate for that, we need to have a prior over P of G.

And there's a variety of different priors of people views.

it turns out that a quite common choice is to simply make P of G constant and

that although it doesn't impose explicitly any penalty on complexity, it

turns out that it works quite well because of the penalty imposed on

complexity by the marginal likelihood. But if I want to build in additional

penalties on complexity, we can make the prior something that exponentially

penalizes the number of edges or penalizes the number of parameters and

that induces additional scarcity. Now importantly, we don't actually want

to define a prior probability over graph structures.

And make sure that it correctly normalizes over all possible structures.

And even more so over all possible acyclic structures but fortunately we

don't have to because the normalizing constant in this distribution P of G.

Is this constant across different networks, and, and can therefore be

ignored completely. And we only need to consider the term the

changes with the changes with the graph structure and ignore the normalizing

constant or partition function. That's the structure prior.

What about the parameter prior? The parameter prior that we use in the

context of the Bayesian score the one that's most commonly used is something

called the B to E prior. And the B to E prior is something that

we've actually seen before when we were talking about parameter estimation for

Bayesian networks. And as a reminder, the B to E prior is

defined by two components, one is an equivalent sample size alpha which is the

total number of instances that we might have seen in some imaginary universe and

the second is a probability distribution p0 typically encoded by a prior Bayesian

network. B0, which encodes sort of our prior

beliefs about about the universe. And, the

And so we define now the imaginary counts that we might have seen for a particular

combination of values for little xi. In particular assignments for its parents

to be, alpha which is the total number of counts that we've seen times the

probability of that particular event. Xi, PAI of G.

In the network, in a prior network B0. Now an important note is that the parents

of the variable xi, in the graph G is not the same as the parents of xi in the

prior network B0. In fact, in many cases, we choose the

network in the network B0 to be one that has no edges where all the variables are

independent. but we compute the probability

distribution in b0, and then use that to compute the the, hyper parameters alpha,

that are used in the context of the of them of the learning problem for for

for given graph G. And the important thing is now, that the

single network be zero provides us with priors for all of the candidate networks,

so we don't need to ellicit, priors for exponentially many networks, we have a

the single network B0, and a single equivalent sample size and we can use

that to compute the parameter prior P of theta given G over all possible networks

that we're interested in evaluating. Now, other than it's convenience, why

this prior as oppose to others. It turns out that one can show that this

prior the unique. Prior.

Over, multinomial Bayesian networks with the important property that if two

networks are i equivalent, then they have the same Bayesian score.

That is, if we use a different set of Dirichlet priors, other than one that

fits this mold. We would have a situation where you might

have two networks G and G prime that were i equivalent that have, different

Bayesian scores and there's really, no justification for that, in terms of,

incoorporating that into the parameter priors, because, these networks are

completely equivalent in their ability to represent the probability distribution,

or the same set of probability distributions so why would we, have one

of them, have a different Bayesian scores than others or rather, if we do have some

prior knowledge that one of these graphs is is more suitable than the other we

should put that into our structure prior. Not into our parameter prior,

one interesting property of the BDE score has to do with its asymptotic behavior.

So, let's consider what happens to the BDE score as, or in fact the Bayesian

score in general, as M goes to infinity. and it turns out that as M goes to

infinity, a network G with Dirichlet priors satisfies the following, in terms

of the marginal likelihood. That the law of the marginal likelihood

is equal to the sum of these three terms. The first of these is just a log

likelihood of the data given the maximum likelihood of parameters.

16:11

Which we've already seen has some good properties in terms of trying to fit the

data. But on the other hand, is also liable to

overfit. The second term, however, is log of m

over two where m is the number of instances times then G which is the

number of independent parameters. Which as we've seen in the context of a

multinomial distribution is the number of entries in the distribution minus one.

And this term has a negative sign which means that we are going, this log, the

log of the marginal likelihood is going to be decreased as we increase the number

of parameters and so there is a tension between those two terms.

The first of these tries to have more complicated networks in order to maximize

the fit to data, as we've already seen. And the second tries to reduce the

complexity of the model by increasing, decreasing the number of parameters.

The third term, is this term O of one, which indicates in formal notation, that

this third term is effectively constant relative to M which means it doesn't grow

with M. With the number of instances which means

that as the number of instances grows, only these first two terms are going to

play a role in terms of evaluating which structure is going to be selected.

Now these first two terms, as it happens, have a name and that, name is something,

called the BIC score, which really looks just add the likelihood component and the

penalty term, and as and in a different part of this course we talk about the BIC

score, and some of the properties that it has, specifically, for example, the fact

that as M grows to infinity, the score is consistent which means that the graph or

one of its i the, if the correct graph for one of its I equivalent graphs is

going to have the highest score among all possible graphs that we're considering.

So this is yet another demonstration that the Bayesian score is a way of avoiding

over fitting. Because, at the large sample limit, we

will learn a correct graph. Either the correct graph or one of its I

equivalent, structures. To summarize, the Bayesian score uses

Bayesian principles. Specifically averaging out over

parameters that we have uncertainty over, to avoid over-fitting.

the Bayesian score is most often instantiated as a particular variant

called BDE. Which requires that we assess a prior

network in order to compute the prior over parameters.

But that also allows us to, a natural place to incorporate prior knowledge into

the learning algorithm. And the BDE prior has the important

property that I equivalent networks have the same score.

The Bayesian score is at the large sample limit equivalent to a different score

called VIC which we analyse seperatley in this course. And as a consequence of this

equivalence one can show that it's asymptotically consitent in that it

learns the correct network as M goes to infinity.