I should say that this is a pseudorealistic learning problem, because

the instances that one samples from a network are, are always cleaner than the

instances that one gets in the context of a real world data case, data set, because

it, in a real world scenario, it is rarely the case that the network whose

structure you have the network whose that you're trying to learn has the exact same

structure as the true underlying distribution from which the data were

generated. And so this is a much cleaner scenario,

but still it's useful and indicative. So what we see here, are, the results of

learning, as a function of the x-axis, which is the number of samples and the

y-axis is a distance function between the true distribution and the learn

distribution, and that distance function we're not going to talk about this at the

moment, it's the notion called the relative entropy, it's also called KL

divergence. But what we need to know about this for

the purposes of the current discussion is that when distributions are identical

it's zero, and otherwise it's non-negative.

So, what we see here is that the blue line, corresponds to maximum life data

information. And we can see several things about the

poline. First of all it's very jagged, there's a

lot of bumps in it, and second, it's consistently higher then all of the other

lines. Which means that max and likelihood

estimation, although it does continue to get lower as we get more data, with as

high as five thousand data points, we still haven't gotten close, to the true

underlying distribution. Conversely let's see what happens with a

Bayesian estimation. This is all Bayesian estimation with a

uniform prior. And different, equivalent sample size.

So this is using a prior network with a uniform network in different values of

alpha. And what we see here is that, for alpha

equals five. That's the green line.

Alph equals ten are almost sitting directly on top of each other and they're

both considerably lower then all of the other lines and also the maximum likely

destination. As we increase the prior strength so that

we are have a firmer belief in in. A uniform prior.

We can see that we move a little bit away.

and now the performance becomes a little worse.

But notice that by around 2,000 data points we're already pretty close to the

case that we were for an equivalent sample size of five.

For 50, which is this dark blue line. It takes a little bit longer to converge,

and it doesn't quite make it, But even with an equivalent sample size of 50,

which is pretty high. you get convergence to the correct

distribution much faster than you do from maximum likelihood destination.

So, to summarize. in Bayesian networks, if we're doing

Bayesian parameter estimation. If we're willing to stipulate that the

parameters are independent, a priori. Then they're also independent in the

posterior. Which allows us to maintain the posterior

as a product of posteriors over individual parameters.

For multinomial Bayesian networks, we can go ahead and, con-, perform Bayesian

estimation using the exact same sufficient statistics that we used for

maximum likelihood destination. Which are the counts corresponding to a

value of the variable and a value of its parents.

And whereas, in the context of maximum likelihood estimation, we would simply

use the formula on the left. In the case of Bayesian estimation, we

are going to use the formula on the right.

Which has exactly the same form. Only, it also accounts for the

hyperparameters. And, in order to do this kind of process,

we need a choice of prior, and we show how that can be effectively elisteted

using both a prior distribution specified say, as Bazy network as well as an

equivalent sample size.