0:00

But we've previously talked about the importance of

allowing CPD representations that encode additional structure in the local

dependency model of a variable on its parents.

And we talked about the cases of tree CPDs.

Which allow us to depend on different variables in different contexts.

But none of that helps us deal with the situation that we used as a motivation

for this. Which is where we have a variable such

as, for example, cough. That depends on multiple different

factors: pneumonia, flu, tuberculosis, bronchitis, and so on and so forth.

This doesn't lend itself say, to a tree CPD.

Because it's not the case that you depend on one only in certain contexts, and not

and not in others. Really, you depend on all of them.

And all of them sort of contribute something to the probability of

exhibiting a cough. So one way for capturing that kind of

interaction is a model called the noisy OR CPD.

And the noisy OR can, is best understood by simply considering a slightly larger

graphical model where we have, where we're trying, where we break down the

dependencies of Y on its parents, X1 up to XK, by introducing a bunch of

intervening variables. So let's imagine that this is, again,

like cough variable. And this is different diseases for

example. And what we're doing here is we're

introducing a intermediate variable that's captures the event at this

disease, X1 if present, causes, a cough, by itself, so this is X1 by itself causes

a cough, or causes Y. You can think of it each of diseases is a

noisy transmitter. If you have the disease,

say if X one is true, then Z1 sort of says, fine X one succeeded in its intent

to make it Y true. X2 has its own little filter called Z2

and Z2 makes that same decision relative to X2.

So ultimately Y is true so Y is true. If, someone succeeded in making it true.

5:02

yeah, but os that's the noisy or CPD.

And you can generalize this to a much broader notion of independence of causal

influence. This is called independence of causal

influence because it assumes that you have a bunch of causes for a variable and

each of them acts independently to affect the truth of that variable.

And so, there's no interactions between the different causes.

They each have their own separate mechanism and ultimately it's all

aggregated together in in a single in a single variable, Z from which the truth

of Y is then is then determined from this aggregate effect of all of the.

All of the effects, ZI's of the different causes.

So, one example of this is, we, we've already seen the noisy orbit.

You can. Easily generalizes to a broad range of

other cases. There's noisy ands where the aggregation

function is an and. There's noisy maxes which apply in the

nonbinary case when causes might not just be turned or off but rather they have

different sort of extents of being turn on and then z is actually sort of the

maximal extent of of, of the, if, the independent effect of

each cause, and so on. So there's a lot, a large range of

different models all of which fit into this family, meesie order is probably the

one that's most commonly used but the other ones have also been used, in other

settings as well. One model that might not immediately be

seen to fit into this framework but actually does, is a model that

corresponds to the sigmoid CPD. So what's a sigmoid CPD?

A sigmoid CPD says that each XI induces a continuous variable which represents WI,

XI. So imagine if each XI is discrete, then

ZI is just a continuous value, WI, which parameterizes this edge, and it tells us,

sort of, how much force, XI is going to exert on making Y true.

So if WI is zero it tells us that XI exerts no influence whatsoever.

If WI is positive, XI is going to make Y more likely to be true and if WI is

negative it's going to make Y less likely to be true.

All of these influences are aggregated together in this expression for the

variables Z which effectively adds up all of these different influences plus an

additional bias term. W0.

And now we need to turn this ultimately into, the probability of the variable Y,

which is the variable that we care about. And in order to do that, what we're going

to do is we're going to pass this continuous quantity Z, which is a real

number between negative infinity and infinity, through a Sigmoid Function.

The Sigmoid Function is defined as follows, and it's a function that some of

you have seen before in the context of machine learning, for example.

So Sigmoid takes the value, the continuous value Z, exponentiates it, and

then divides by one plus that exponent of Z.

And. Since E of Z, since E to the power of Z

is a positive number, this gives us a number that is always in the interval of

0,1. And if we look at what this function

looks like. It looks like this.

So, this is the sigmoid function. The X axis here is the value Z.

And the Y axis is the sigmoid function. And you can see that as Z gets very

negative, the probability goes to zero. As Z gets, close, very high, the

probability gets close to one, and then there's this interval in the middle where

intermediate values are taken. You can.

So this is kind of like a squelching function that that sort of squashes the

function on both ends. Let's look at the behavior of the sigmoid

CPD as a function of different parameters.

So here is a case where all of the X Is have the same parameter W.

And so what we see here is the value of this parameter W, and over here is the

number of XI's that are true. So let's look at, first this access over

here, the more parents that are true, the more parents that are on, the higher.

The probability of Y to be true, okay. And this, it holds for any value of W

because these are all positive influences.'Kay.

So the more parents are true, the more things are pushing Y to take the value

true. This axis over here is the axis of the

weight and we can see that for low weights, you need an awful lot of X's to

get Y to be true but as W increases, Y becomes true with a lot of fewer positive

influences. This graph on the right now.

Is what we get when we basically just increase the amplitude of the whole

system. We multiply both W and W0 by a factor of

ten. And what happens is that, that means that

the exponent gets pushed up to extreme values much quicker.

Z gets dissect, effectively multiplied by a factor of ten.

And that means that the transition becomes considerably sharper.

That gives us a little bit of an intuition on how the sigmoid function and

how the sigmoid cpd behaves. So what are some examples of this kind of

a of an application of this. So I showed this network in an earlier

part of this course it's the CBCS network and it's used for the it was developed

for the here at Stanford Medical School for diagnosis of internal diseases.

And so, up here we have things that represent predisposing factors.

11:28

And there's actually a fairly eclectic range here.

So for example one tree disposing factor is intimate contact with small rodents.

because that's the contributing factor for the antivirus.

and so there's a whole range of predisposing factors.

Down here in the middle we have diseases. An down at the bottom, we have symptoms

and test results. Now as I previously mentioned there's

approximately 500 variables in this network and they take on network about

four values. So the total number of entries in a joint

distribution over this space would be approximately four to the 500 different

parameters, which is clearly an intractable number.

If we were to take this network, and just, if you take this distribution

represented with the network shown in this diagram we get considerable

sparsificaton and the factorized form. As approximately 134 million parameters,

which is still much too many that have a human estimate.

By using as in this case, they use a noisy max CPD.

They brought the number of parameters to about 8,000 total parameters for this

network. Which is a much more attractable number

to deal with.