0:00

We talked about base net structure learning as the task of optimizing a

score over a space of network structures and one of the big design twists that we

have to make in this context is which scoring function to use.

Our first attempt at that was to use the likelihood score but as we've already

seen the likelihood score is very susceptible to over fitting and we'll

invariably learn the most complicated network that it can given the constrains

that we impose on our space. We now consider a different approach in

which rather then constraining the space or maybe in addition to constraining the

space we're also going. To impose a penalty on the complexity of

the structure that we learned so as to force the model to trade, so as to force

the learning algorithm to trade off model complexity with the fit for data.

And specifically the model that the specifically the score we're going to

look at right now is something caled the BIC score.

So, let's look at one way of penalizing complexity, which is the one that the BIC

score uses. The BIC score has two terms in it.

The first is just the likelihood of the, Graphs and its maximum likely of

parameters relative to the data. This is a familiar term.

This is the term that represents the the same thing as the likelihood score.

So this term is just score L of G relative to D, and were we to use this in

isolation, we would get the same over-fitting behavior.

Though we add to this a penalty term, this is the second term here on the

right. And that term is, is.

Subtract log of M over two times the dimension of G.

So let's understand what these different pieces are.

M is the number of training instances. And the dimension of G is the number of

independent parameters. In the network.

So we talked about the concept of independent parameters in the context of

multinomial networks. And, as a reminder, a multinomial

distribution has one fewer independent parameters than the number of entries in

the multinomial. And from that we can compute the number

of independent parameters for any multinomial network.

So this basically counts the number of, of degrees of freedom that we have in the

network in terms of estimating independent parameters in it.

And so these two terms balance each other out.

We've already seen that the term on the left, the log likelihood, tries to push

towards a better fit to the training data, whereas the term on the right, the

log, the the penalty term is going to try and keep the number of independent

parameters and therefore the network complexity down.

And so, this score provides us with some kind of trade-off between fit to data and

model complexity. Now this is one choice in terms of this

trade-off and in fact there are other scores that use a different trade-off

between those two terms but this one is one that is very commonly used in

practice and has several distinct motivations.

Some one, some of which we'll talk about and others, and others not.

One comment that's worth making though is that the negation of the score.

So if we negate this entire term that is often called.

the MVL criteron where MVL stands for Minimum.

Description. Length.

And so, in fact, this notion of minimum description length is an information

theoretic justification for this, and the other justification for this is one

that's derived from a more Bayesian criterion, which is why BIC stands for

Bayesian. Information criteria.

4:07

Let's look at the asymptotic behavior of this penalized score.

We've already seen that in the context of the likelihood score, it really doesn't

matter how much training data we have. We will almost always pick the most

densely connected network that our assumptions allow.

This is no longer the case when we have this penalized score.

So let's, to understand that, let's breakdown the first of these two terms,

which is the likelihood score. And remind ourselves that at least in the

context of multinomial networks the likelihood score

Can be re-written as, in the following way, so this is the, the breakdown of the

likely score. And it has these two terms, the first is

the number of data instances M times the sum over the variables X I, of the mutual

information between X I and it's parents in the network.

And that mutual information is relative to the empirical distribution P-hat.

The second term in the likelihood score is a term which is also M times the sum

over the variables of the entropy of the variable, again, in the empirical

distribution. And, as we've already discussed before,

this term is independent of G. And so, doesn't affect the choice of

which structure is selected. Because it's the same for all structures.

And so we have these two terms that are playing off against each other.

We have the term over here, the red term. Which is M times the sum of the mutual

information. And we have the second term, the blue

term which is the log of M over two times the number of independent parameters in

G. Now, if we consider these two terms, we

see that the mutual information term grows linearly with M.

Whereas the complexity term grows logarithmic-ally with M.

And so as we get more and more data instances, we put more emphasis on the

fit to data and less emphasis on the model complexity, so intuitively we would

infer that, we would, as we have more data instances, we'd be more, amenable to

learning more complicated structures. So that property gives rise to a very

important result that we can state to regarding the DIC score.

And that is a result called, consistency. Consistency tells us what behavior we

would get, what network we would learn As the number of samples grows to

infinity. And so here we're going to assume that

the data is actually generated from a particular true structure g star.

So there is a g star because otherwise the result doesn't really make can't

really be stated. And what the consistency property says is

that as m grows to infinity, the true structure g star.

Max is going to be the one that maximizes the square.

Now that by itself is not quite right, as, as we can see, because the true

structure G star might have several, other structures that are I equivalent to

it, and we've already seen that, the likelihood score and, in fact it also

turns out to be the case that the, Penalty term are the same for eye

equivalent networks. And so it's not just the true structure

alone that will maximize the score, it's the true structure or any other structure

that's eye equivalent with. But as far as we're concerned that's fine

because we've effectively learned the correct representation of the probability

distribution. So to understand why this result is true,

we're going to give just a very high-level intuitive argument.

We're not going to prove the theorem. It's a little bit tricky to prove

completely formally. so let's first consider the case of why

this is not going to overfit. That is, why we're not going to have

spurious edges learned in maximizing the score, as the number of instances grows

to infinity. So here we go, we go back to we got back

and look at this formula over here. And we see that as the n grows to

infinity, p hat is going to approach p star, where p star is my true underlying

distribution. And so what we're going to have in this

first term over here is effectively the mutual information relative to p star

between xi and its parents. And in that p star, the mutual

information between Xi and its parents is,

You don't get any benefit from adding additional parents, spurious parents.

That is the mutual information in P star for spurious parents is not going to

grow. Because in the true underlying

distribution P star. the, there, the,

There is no Additional independence.

There's no additional correlation that we have.

And so, at that point, we're going to have that the more complicated structure

is about the same as G star in terms of this first term, but the spurious edges

are going to cost us. In terms of the number of parameters in

the blue term, and so, these spurious edges will not contribute to the like, to

the data likelihood term, but will be penalized more, and so we will, choose

the sparser network that corresponds to G star.

Conversely why do we not under-fit? That is why do we eventually learn all

correct edges. And that is because the data likelihood

term tells us that edges that are required that is edges that RNG star for

example or the I equivalent network. Or will If we don't have them there this

mutual information term will be lower than it could be.

And so we will have a higher score by including those terms in the mall.

And because this mutual information term. Grows with M linearly versus the penalty

term which grows log rhythmically then eventually the first term will dominate

and we will have a, it will be beneficial for the learning algorithm to add that

pitch. The required edge and g star.

So, this is a high level, very hand wavy argument, but at least it gives an

intuition as to why consistency holds. So to summarize, the BIC score, is a very

simple score, that trades off, model complexity with, the fit, to data, and,

therefore, has the important property of asyntotic consistency which means that if

the data were actually generated by a network G star, for, which is a perfect

map for the distribution, then, either G star or networks I equivalent would, will

have the highest score, as an N grows to infinity.