0:47

Now, if we're trying to make a prediction over the value of a variable X, that

depends on the parameter theta. Well, we're just, this is just now a

problem of inference problem. So the probability of x,

is simply, the probability of x given theta.

Marginali times the prior over theta. Marginalizing, in this case corresponding

to an integration over the value of zero. And I give this, this interval over here.

So, if we plug through the integral, what we're going to get is the following form.

And I'm not going to go through the integration by parts that's required to

actually show this. But it's really a, a straight-forward

consequence of the properties of integrals of polynomials.

in this case we have that the probability that x takes the particular value xi

is one is 1 / Z times the integral over all of the parameters theta of theta i

which is the, probability given the parameterization theta, that x takes the

value of little xi times this thing over here, which is the prior.

And we multiply the two together, integrate out over the parameter vector,

theta which, in this case, is a k dimensional parameter vector.

And it turns out that when one does that, you end up with alpha i over the sum of

all J's alpha J, a quantity typically known as alpha.

And so we end up with a case where, the prediction over the next instance

represents the fraction of the instances that we've seen, as represented in the

hyperperimeter of Dirichlet where we have x, little xi.

So if alpha i represents the number of instances that we've seen where x eq-.

Where the variable took the value, little xi.

The, prediction very naturally represents.

It, it is simply the fraction of the instances with that property.

And so once again, we see that there is a natural intuition for the hyper

parameters as representing the motion of counts.

3:25

Now, let's put these two results together and think about Bayesian prediction as a

function of, as the number of data instances that we have grows.

So here, we have a parameter theta, which initially was distributed as a Dirichlet,

with some set of hyper-parameters. And let's imagine that we've seen m beta

instances, x1 up to xm. And now we have the M plus first data

instance. And want to make a prediction about that.

So, the problem that we're trying to solve is now the probability of the M

plus first data instance, given the M first, the M instances that we've seen

previously. And so, we can once again plug that into

a probabilistic inference equation. So this is going to be the probability of

the M plus first data instance given everything, including theta times the

probability of theta given x[1] up to x[m] So we've introduced the variable

theta into this probability and we're marginalizing out over the variable

theta. Well, one thing that, immediately follows

is because of the structure of the, probabilistic graphical model here.

We have that xm + one is conditionally independent of all of these previous xes,

given theta. And so we can, cancel these from the

right-hand side of the conditioning bar. Which gives us, over here, probability of

xm + one given theta. And over here, we have the probability of

theta. Given x1 of xm.

5:06

And, so now let's think about the blue equation, the blue expression over here,

which is just the posterior. Over thena, given D.

Which are X1 off the XM. And we've already seen what that looks

like. That, as we showed just on the previous

slide is simply Dirichlet who's hpyerperimeters are Alpha one plus m1 up

the Alpha 1 plus mk. And so now, we're making a prediction of

a single random variable from a Dirichlet that has a certain set of

hyperparameters. And that was a thing we showed on the

slide just before that. Which is simply the fraction of the out.

The fraction of the hyperparameter corresponding to the outcome, xi.

As a fraction of all of, the sum of all of the hyperparameters.

Where, again, just to introduce notation. Alpha is equal to the sum of the sum of

alpha I. And M to the sum of the MI's.

6:11

Now notice what happens here. This parameter alpha that we just defined

which is the sum over all of the alpha I's that I have is a parameter known as

the equivalent sample size. And it represents the number of if you,

if you will imaginary samples that I would have seen prior to receiving the

new data, x1 of xm. Now look what happens if we multiply

alpha by a constant. So say we double all of our alpha I, then

we have we're going to let the MI's effect our estimate a lot less than for

smaller values of alpha. And so the larger the alpha, the more

confidence we have in our prior, and the less we let our data move us away from

that prior. So let's look at an example of the

influence that this, might have. So let's go back to, binomial data, or

Bernoulli random variable. And let's take the simplest example where

a prior is uniform for theta in 01. And we've previously seen that, that

corresponds to Dirichlet with hyperparameters 1-1.

7:30

So, this is, are deascht so this a general purpose derscht slave

distribution, in this case the hyper parameters are one, one and let's imagine

we get, five data incidences of which we have four ones.

And one zero. And, if you actually.

Think about the differences between what the Bayesian estimate gives you for the

sixth next coin toss relative in, in, when

doing maximum likely estimation versus the Bayesian estimation.

For maximum likely estimation we have, four heads, four tails.

Maximum likely is is four fifths, so that's going to be the prediction for the

sixth instance. The Bayesian prediction on the other

hand, remember is going to do the hyper-parameter alpha one plus M1 divided

by alpha plus M which in this case is going to be one plus four divided by two

plus five and that's suppose to give us 5/7.

8:39

Now let's look more qualitatively at the effect of the predictions, on a next

instance, after seeing certain amounts of data.

And for the moment, we're going to assume that the ratio between the number of 1s

and the number of 0s is fixed, so that we have one 1 for every four 0.

And that's the data that we are getting. And now let's see what happens as a

function of the sample size. So as we get more and more data, all of

which satisfy this particular ratio. So here we're playing around with a

different strength, our equivalent sample size but we're fixing the ratio of alpha

one to alpha zero to represent in this case the 50% level.

So our prior is a uniform fire but of greater and greater changing strength.

And so this little green line down at the bottom represents a low alpha.

Because we can see that the data gets pulled our, posterior.

So sorry. The line is drawing the posterior over on

the parameter or rather equivalency, the prediction of the next data instance over

time. And you can see here that alpha is low

and that means that even for fairly small amounts of data say twenty data points

are fairly close to the data estimates. On the other hand, this bluish line here

We can see that the alpha is high. And that means it takes more time for the

data to pull us, to the empirical fraction of heads versus tails.

Now let's look at varying the other parameter.

We're going to now fix the equivalent sample size.

And we're going to just start out with different prior.

And we can see that now we get pulled down to the 0.2 value that we see in the,

in the empirical data. and the further away from it.

We start, though. It takes us a little bit longer to

actually get pulled down to the data estimate.

But in all cases, we eventually get convergence to the value in the actual

data set. But, from a pragmatic perspective it

turns out that Bayesian estimates provide us with a smoothness where the random

fluctuations in the data don't don't cause quite as much random jumping

around as they do for example in maximum likelihood estimates.

So if what we have here is the actual value of the coin toss at different

points in the process, you can see that the blue line, this

light blue line corresponds to maximum likely data estimation basically bops

around the pheromone, especially in the low data regime.

Whereas the ones that use a prior, estimate to be the prior are considerably

smoother, and less subject to random noise.

In summary, Bayesian prediction combines two types of, you might call them

sufficient statistics. There is the sufficient statistics from

the real data. But there's also sufficient statistics,

from the imaginary samples, that, contribute, eh, to the derscht laid

distribution, these alpha hyper parameters, and the basion prediction

effectively makes the prediction about the new data instance by combining both

of these. Now, as the amount of data increases,

that is, at the asymptotic limit of many beta instances.

The term that corresponds to the real data samples is going to dominate.

And therefore, the prior is going to become vanishingly small in terms of the

contribution that it makes. So at the limit, the Bayesian prediction

is the same as maximum likelihood destination.

12:50

But initially in the early stages of destination, before we have a lot of data

the the priors actually make a significant difference.

and we see that the Dirichlet hyper parameters basically determine both our

prior beliefs, initially before we have a lot of data as well as the strength of

these beliefs that is how long it takes for the data to outweigh the prior, and

move us towards what we see in the empirical distribution.

But importantly, even as we've seen here in the very simple examples, and as we'll

see later on when we talk about learning with Bayesian networks,

it turns out that this Bayesian learning paradigm is considerably more robust in

the sparse data regime, in terms of its generalization ability.