1:28

So, let's look at some of the let's look at a simple example to see what the what

the likelihood score looks like. So here, we have distribution over two

random variables X and Y, and let's consider two graphs.

G0 on the left has no edge between X and Y,

and G1 on the right has an edge from X to Y.

And the Y to X case would be symmetrical, so we're just going to look at these two

graphs. The likelihood score of a graph of this

graph relative to, to D by the decomposition of the likelihood function

that we've already seen, is the sum over all instances m of the log of the

parameter for the value of X in the m-th data instance,

plus the log of the parameter for Y in the m, for the Y value in the m-th data

instance. So, this is just the decomposition of the

log likelihood that we use when we talked about parameter estimation.

Conversely, when we look at G1, we have a very similar formula where, again, we sum

over all instances. And for the X that has no parents, we

have log of theta hat Xm where, again, Xm is the value of X in the mth data

instance. And for the Y value, we have log of theta

hat of Ym given Xm. And in both cases, theta hat are the

maximum likelihood parameters in the respective graphs.

So, on the left, it's theta hat for the graph G0,

and on the right it's the maximum length the theta hat for the graph G1.

Now, let's refine this analysis a little bit more by converting it to a somewhat

different notation. first we're going to look at the

difference between these two likelihood scores.

So, we're going to see, we're going to look at the score for G1

and subtract away the score for G0, and then we're going to see if that score is

positive or negative. That is, if we prefer G1 to G0 or not.

So, notice that the term over here, the first term,

the X term, is the same in both of these graphs.

And so, it cancels out. And so, what we have left over is the sum

over m of log of theta hat Ym given Xm minus log of theta hat of Ym.

Now, we can reformulate this by looking at the

sufficient statistics. And specifically, we're going to have the

term theta hat Ym given Xm occur recur every time we have a particular value Y

and a particular value X. So, we're going to sum up over all values

little x and little y, and the parameter log of theta hat y given x is going to

occur m of xy times, where m of xy is the sufficient statistics or the counts for

that particular configuration of events in the data set D.

The second term, the, the one that we're subtracting, is a

sum over y of m of y log of theta hat of y.

Where again, m of y is a sufficient statistics.

Now, we're going to rewrite this in terms of this distribution over here called P

hat. P hat is what's called the empirical

distribution. It's what happens when we look at our

data set D, and just look at the frequencies of different events. So, m of

xy is simply the number of data instances, m, times P hat of xy, because

this is the frequency of the event and this is the total number of data

instances and so, m of xy is just the product of those two.

So, we can write the first term as m times sum of xy P hat of xy.

And interestingly, we can also rewrite the the maximum likelihood estimation

parameters in terms of P hat as well, because theta hat of y given x is simply

the, the fraction of the y cases among all the x cases, which is exactly the

same, as P hat of y given x. Similarly, the second term turns into m

times P hat times the sum over y P hat of y, log of P hat of y.

So, that now converted all of the expressions, the m's and the n theta's

into a single vocabulary which are these empirical distributions p hat.

So now let's take the M out of the equation.

And furthermore, look at the sum over y as the sum over pairs xy.

So we've artificially introduced a variable x into the second summation,

sum over y, P hat of y. And now, we're going to rewrite that as

sum over xy, P hat of xy. And because x doesn't appear in the

expression log P hat of y, we can we can do that because sum over x,

P hat of xy is equal to P hat of y. So, looking at these two expressions

together, we can now move around some things by utilizing properties of the

logarithm and derive that the following, that this is equivalent to the following

equation. This is M times the sum over xy, P hat of

xy log P hat of xy divided by P hat of x, P hat of y.

And the reason why that's the case is that P hat of y given x is equal to P hat

of y, x divided by P hat of y. sorry.

Divided by P hat of x. And now by moving around the logarithms,

we get exactly this expression. Now importantly, this expression here

inside the sum the summation over here has a name.

And that summation is called the mutual information,

and is usually denoted as I sub P hat, in this case the distribution P hat between

x and y. So why, why is this called visual

information? Its because it measures if you, if you

look at this the distance on average between the joined distribution x and y,

P had of xy in the numerator relative to the distribution we would get if we, if

the distribution was a product of marginals.

So, P hat of x times P hat of y. So, you can think of this term inside the

log as a relative error if you will, a distance between the joint distribution

and the product of the marginals. And now, we're taking the average of the

log of this expression averaged by how frequent the different cases xy are.

And so that's so you can think of it as an average distance between the joint

distribution and the one that we would get if this was a product of marginals.

So, this is an information theoretical property which, as I said, is called the

neutral information. .

And so, if we now generalize this analysis to an arbitrary network it turns

out that the likelihood score for any graph can be viewed as the can be

rewritten as the number of data instances m times the sum over all variables I of

the mutual information again between a node and its parents in the graph,

minus the constant term that is constant relative to the graph structure.

The second term is M times the sum of the entropies of the individual variables,

and this term doesn't depend on G. Now, why is this a significant result?

It's significant because it tells us that the value of a network, the score of the

network if you use the likelihood score is higher, so score is higher if Xi is

correlated with his parents, which is a very intuitive property. The

more a variable is correlated with a parent the better the network structure.

Which means, we would want to put as a parent of a variable a parent that is

highly correlated with it. Which is kind of exactly the behavior

that you would intuitively hope for. So, this is a good result because it

tells us that the parents that we pick for a variable are the ones that are

have, have the highest correlation with it, the ones that have the highest mutual

information. And that's a very satisfying result.

However, as we now show, this mutual information result also has some negative

consequences. So, to understand that, let's go back to

our simple example and let's look at the difference between the scores of these

two graphs. The graph G1 on the left that has the

edge, and the graph G0 on the right that doesn't.

And let's look at the difference between those two scores.

That difference, if it's positive, suggests that we should pick the graph

G1. And if it's negative, tells us that the

maximum likelihood score will pick the graph G0.

And that difference is M, the number of samples times the mutual information

between the variables X and Y in the empirical distribution P hat.

Now, a well known result from information theory is that the neutral information,

this this quantity over here is always non negative, for any distribution P.

Furthermore, this mutual information is equal to zero.

That is this inequality turns into an equality if and only if X and Y are

actually independent. In the distribution relative to which we're computing the

mutual information, which in this case is the distribution P

hat. Now, even if X and Y were actually

independent in the original distribution, the one that generated the training

instances, it is still very unlikely that we will achieve exact and perfect

independence in the empirical distribution just because of statistical

fluctuations in the set of samples that are generated. And so, even if X and Y

are independent, it is almost never the case that they're are independent in P

hat. Which means that in almost all of the

cases, this mutual information between X and Y

is going to be greater than zero, almost always.

12:46

Which tells us that adding this edge can never hurt and almost always helps.

And that's true not just in this simple example, but in other cases as well.

That is, it almost always is better in terms of the likelihood score to have

more edges rather than fewer edges. That will always increase the likelihood

score. Which means that in general except for

very unusual circumstances, the likelihood score will be maximized for

the fully-connected network. So, that gives rise to a very significant

over fitting effect. Because as we've already seen, the num

the more edges we have the more difficult it is to fit the perimeters because we

have fragmentation of our data set into these tiny little buckets each of which

has a very small number of instances in it.

So, how do we avoid over-fitting? It turns out that there are several

different strategies that are typically employed.

The first is to restrict the hypothesis base to basically prevent the algorithm

from over-fitting. And we can do that by restricting the

number of parents that we allow per node, or some kind of restriction on number of

parameters. This is usually easier to enforce and so

that's a more common strategy. A somewhat more flexible strategy is to

use a scoring function that makes a better set of trade-offs.

That is, where there is a penalty on the complexity of the model at the same time

that we're trying to get a good fit to the training data.

And that's a more flexible strategy than a hard constraint on the model

complexity. Because if there is a very strong signal

for some correlation between a pair of variables, that can eventually outweigh

the complexity penalty and allow that edge to be added in.

Whereas, if we have a hard constraint, we might

never be able to learn a model that is the appropriate one, simply because it's

it's never going to be it's, it's not going to fit into the restrictive

hypothesis space. These complexity penalizing scores

generally fall into two categories. There's ones that explicitly penalize

complexity, and then there's the class of models called the Bayesian, class of

scores called the Bayesian scores which as we'll see average over all possible

parameter values following the Bayesian paradigm that says, that anything we have

uncertainty over we should have a distribution over.

And as we'll see, that gives rise naturally to a score that avoids over

fitting. To summarize the likelihood score is a

score that evaluate a model G by looking at the log likelihood of the data

relative to G using the MLE parameters for G.

And, that set of parameters was optimized to maximize the likelihood of D.

And therefore, is very, very well geared to trying to capture the exact

characteristics of the data, for better and for worse.

So, for better, it gives us a very nice information theoretic interpretation of

the set of edges that are chosen, and the set of parents that are chosen for a

given variable, in terms of the dependencies that

that we encode in the graph G. Conversely, that same characterization

also shows us that we're guaranteed to overfit the model to the training data,

unless we somehow impose constraint, or otherwise penalize model complexity.