0:00

So, now we're getting into Bayes Net Lab. And we're finally going to start talking

about the actual representations that are going to be the bread and butter of what

we're going to describe in this class. And, so we're going to start by defining

the basic semantics of a Bayesian network and how it's constructed from a set of

from a set of factor. So let's start by looking at a running

example that will see us throughout a large part of at least the first part of

this course and this is what we call the student example.

So in the student example, we have a student whose taking the class for a

grade and we're going to use the first letter of of the word to denote the name

of the random variable just like we did in previous examples.

So, here, the random variable is going to be G.

now the weight of the student obviously depends on how difficult the course that

he or she is taking and the intelligence of the student.

So that gives us in addition to G, we also have D and I. And we're going to

add a couple of extra random variables just to make things a little bit more

interesting. So we're going to assume that the student has taken the SAT,

so, he may or may have not scored well on the SAT,

so that's another random variable, S. And then finally, we have, that this case

of the disappearing line, we also have the recommendation letter, L, that the

student gets from the instructor of the class.

Okay? And we're going to grossly oversimplify this problem by basically

binarizing everything except for grades. So everything has only has two values

except for grade that has three and this is only so I can write things compactly.

This is not a limitation of the framework, it's just so that the

probability distribution don't become unmanageable.

Okay. So now, let's think about how we can construct the dependencies of of

this, in this probability distribution. Okay.

So, let's start with the random variable grade.

I'm going to put G in the middle and ask ourselves what the grade of the student

depends on. And it seems, just, you know, from a

completely intuitive perspective, it seems clear that the grade of the student

depends on the difficulty of the course and on the intelligence of the student.

And so we already have a little baby Bayesian network with three random

variables. Let's now take the other random variable

and introduce them into the mix. so for example, the SAT score of the

student doesn't seem to depend on the difficulty of the course or on the grade

that the student gets in the course. The only thing it's likely to depend on

in the context of this model is the intelligence of the student.

And finally, caricaturing the way in which instructors write recommendation

letters. We're going to assume that the quality of

the letter depends only on the student's grade,

but professor's teaching, you know, 600 students or maybe 100,000 online

students. And so, the only thing that one can say about the student is by looking

at their actual grade record and so the and so, regardless of anything else, the

quality of the letter depends only on the grade.

Now, this is a model of the dependencies, it's only one model that one can

construct through these dependencies. So for example I could easily imagine

other models, for instance, ones that have students who

are brighter taking harder courses in which case, there might be potentially an

edge between I and D. But we're not going to use that model,

so let's erase that because we're going to stick with a simpler model for the

time being. But, this is only to highlight the fact

that a model is not set in stone, it's a representation of how we believe the

world works. So, here is the model drawn out a little

bit more nicely than than the picture before.

And now let's think about what we need to do in order to turn this into our

presentation of probability distribution, because right now, all it is is a bunch

of you know nodes stuck together with edges and so how do we actually get this

to represent the probability distribution?

And the way which we're going to do that is we're going to annotate each of the

nodes in the network with what's called, with a CPD.

So, we previously defined CPD. CPD is just as a reminder,

is a conditional probability distribution hm,

using the abbreviation here. And, each of theses is a CPD,

so we have five nodes, we have five CPDs. Now, if you look at some of these CPDs,

they're kind of degenerate, so for example, the difficulty CPD isn't

actually conditioned on anything. It's just a unconditional probability

distribution that tells us, for example, that courses are only 40%.

likely to be difficult and 60% to be easy.

here is a similar unconditional probability for intelligence.

Now this gets more interesting when you look at the actual conditional

probability distributions. So here, for example, is a CPD that we've

seen before this is the CPD of the grades A, B, and C.

So, here is the conditional probability date distribution that we've already seen

before for the probability of grade given intelligence, and difficulty, and we've

already discussed how each of these rows necessarily sums to one because the

probability distribution over the variable grade and we have two other

CPD's here. In this case, the probability of SAT

given intelligence and the probability of letter given grade.

So, just to write this out completely, we have P of D,

P of I, P of G given I, D, P of L given G, and P of F given I.

And that now is a fully parameterized Bayesian network and what we'll show next

is how this Bayesian network produces a joint probability distribution over these

five variables. So, here are my CPDs and what we're going

to define now is the chain rule for Bayesian networks and that chain rule

basically takes these different CPDs, these, all these little CPDs and

multiplies them together, like that.

Now, before we think of what that means, let us first note that this is actually a

factor product in exactly the same way that we just defined.

So here, we have five factors, they have overlapping scopes and what we

end up with is a factor product that gives us a big, big factor whose scope is

five variables. So what does that translate into when we

apply the chain rule for Bayesian networks in the context of the particular

example? So let's look at this particular

assignment and remember there's going to be a bunch of these different assignments

and I'm just going to compute this the probability of this one.

So the probability of d0, i1, g3, s1, and l1, well, so the first thing we need is

the probability of d0 and the probability of d0 is 0.6.

7:18

The next thing from the next factor is the probability of i1,

which is 0.3. What about the next one?

The next one now is a conditional factor and for that, we need the probability of

g3, because we have g3 here, so we want from this column over here, and which row

do we want, we want the probability, the row that corresponds to d0 and i1, which

is this row, so 0.02.

Continuing on, we know now that we, we now have g3, so

we want from this row and we want the probability of l1, so we want this entry,

0.01. And finally, here, we have the

probability s1 given given i1, so that would be this one over here and 0.8.

And in this exact same fashion, we're going to end up defining all of the

entries in this joint probability distribution.

Great. So, what does that give us as a

definition? a Bayesian network is a directed acyclic

graph. Acyclic means that has no cycles that is

you don't, you can't reverse the edges and get back you started.

This is typically abbreviated as a DAG and we're are going use the letter G to

denote to denote directed acyclic graphs. And the nodes of this directed acyclic

graph represent the random variable that we're interested in reasoning about, the

X1 up to Xn. And for each node in the graph, Xi, we

have a CPD that demotes the dependance of Xi on its parents in the graph G. Okay?

So, this is a set of variables, so this would be like the probability of

G given I and D. So, this would be G,

the Xi would be G and, and the parents of G, of, of Xi would be I and D.

And the BN, this Bayesian network, which represents a joint probability

distribution via the chain rule for Bayesian networks,

which is the rule that's written over here.

And this is just a generalization of the rule that we wrote on the previous slide,

where we multiplied 1CPD for each variable.

Here, we have, again, we're multiplying, we're multiplying over all the variables

I and multiplying the CPD for each of the Xis and this is once again a factor

product. So, that's great, and I just defined the

joint probability distribution, but who's to say that that joint

probability distribution is even a joint probability distribution?

So what do you have to show in order to demonstrate that something is a joint

probability distribution? First thing you have to show is that it's

greater than or equal to zero, because probability distributions better be

non-negative. And so, in order to do that, we need to

show that for this distribution P. And the way, and this as it happens is,

is quite trivial, because P is a product of factors.

10:36

And you if multiply a bunch of non-negative factors, you get a

non-negative factor. That's fairly fairly, straightforward.

Next one's a little bit trickier. The other thing that we have to show in

order to demonstrate that something is a legal distribution is to prove that it

sums to one. So how do we prove that something sums to

that this probability distribution sums to one?

So, let's actually work through this and one of the reasons for working through

this is not just to convince you that I'm not lying to you, but more importantly,

the techniques that we're going to use for this simple argument are going to

accompany us throughout much of the first part of the course.

and so it's worth going through this in a little bit of detail.

but I'm going to show this in the context of the example of the distribution that

we just showed, we just end up cluttered with notation, but the, but the idea is

still exactly the same. So,

let's sum up over all possible assignments D, I, G, S, L of this

probability, of this hopefully probability distribution that I just

defined the P of D, I, G, S, and L. Okay?

So we are going to brake this up using the chain rule for Bayesian networks we

because that's how we defined this distribution P.

Okay. So here is the magic trick that we use

whenever we deal with distributions like that.

It's to notice that when we have the summation over several random variables

here, and here inside the summations, we have a bunch of functions inside here

that, only some, of the, the, each, factor only involves a small subset of

the variables. And when you do, when you have a situation like that, you can push

in the summations. So specifically, we're going to start out

by pushing in the summation over, L, and which factors involve L? Only the last

one. So, we can move in the summation over L

to this position ending up with the following expression.

So notice that now the summation is over here,

that's the first trick. The second trick is the definition of

CPDs and we observe here that the summation over L is summing up over all

possible mutually exclusive and exhausted values of the variable L in this,

conditional probability distribution P of L given G.

And that means, basically, we're summing up over a row of

the CPD and that means that the sum has to be one.

And so this term effectively cancels out, which gives us this,

and now we can do exactly the same thing. We can take S and move it over here, and

that gives us that, and this two is a sum over a row of a CPD, so this two

cancelled and can be written as one. We could do the same with G, and again,

this has to be equal to one and so we can cancel this and so on and so forth.

And so, by the end of this, you're going to have, you're, you're going to

have cancelled all of the variables in the summation and what remains is one.

Now, as I said, this is a very simple proof, but these tools of pushing in

summations and removing variables that are not relevant are going to turn out to

be very important. Okay.

So now, let's just define a little bit of a terminology that will, again, accompany

us later on. We're going to say that a distribution P

factorizes over G, means we can represent it over the graph G if we can encode it

using the chain rule for Bayesian networks.

So the distribution factorizes over big, over a graph G if I can represent it in

this way as a product of these conditional probabilities.

14:40

So let me just conclude this with a little example which is arguably, the

very first example of Bayesian networks ever.

it was an example of Bayesian networks before anybody even called them Bayesian

networks. and it was defined actually by

statistical geneticists who were trying to model the notion of genetic

inheritance in in a population. And, so here is a an example of genetic

inheritance of blood types, so let me just give a little bit of background on

the genetics of this. So, in this very simple, oversimplified example a person's

genotype is defined by two values, because we have two copies of of each of

those, of each of those chromosomes in this case.

And, so it, so for each chromosome, we have three possible values,

so we have either the A value, the B Value, or the O value.

These are familiar to many of you from blood types,

which is what we're modelling here. And so the total number of genotypes that

we have is listed over here, we have the AA genotype,

AB, AO, BO, BB, and OO, so we have a total of six distinct

genotypes. However, we have few more phenotypes

because it turns out that AA and AO manifest exactly the same way as blood

type A and BO and BD manifest as blood type B, and so we have only four

phenotypes. So how do we model genetic inheritance in

a Bayesian network? here is a little Bayeseian network for

genetic inheritance, we can see that for each individual, say Marge.

We have a genotype variable, which takes on six values, one of the six values that

we saw before. And the, the phenotype, in this case the

blood type, depends only on the person's genotype.

We also see that a person's genotype depends on the genotype of her parents,

because after all, she inherits one chromosome from her mother and one

chromosome from her father. And so and so the,

a person's individual genotype just depends on those two variables.

And so this is very nice and very compelling Bayesian network, because

everything just fits in beautifully in terms of

things really do depend only on only on the variables that that we stipulate in

the model that they depend on.