0:02

So as we said at the very beginning, there's two main families of graphical

models. There's those that are based on directed

graphs, directed acyclic graphs and those that are based on undirected graphs.

The undirected graphical models are typically called Markov networks, they're

also called Markov random field. We're going to start by talking about the

simplest subclass of those which is pairwise Markov networks and then we're

going to generalize it. So let's look again at a toy example just

to illustrate what's going on. This is an example of four people who are

studying together in study pairs. And notice that Alice and Charles don't

get along. And, you know, Bob and Debbie had a bad

breakup so they don't talk to each other either.

And so really, we only have the study pairs that are marked by the edges on

this diagram. And, what, we are, and the random

variables here indicate whether the students have a particular misconception

because the material was a little confusing so the random variable says,

does the student have a misconception or not this particular misconception, and

the, the intuition here says that if two students study together then they kind of

influence each other and so we have so for example if Alice and Bob study

together, then this edge indicates. that one, if one of them has the

misconception, the other ones likely to have the misconception.

Now, this doesn't fit neatly into the perview of a directed graph, because the

influence flows in both directions, and so you can't really point an arrow from

Alice to Bob or from Bob to Alice and so we're going to use an undirected graph to

represent this. That's great but how do you parametrize

an undirected graph, because you no longer have the notion of a conditional

probability distribution because there's no variable that's condition in one that

you condition on? And so and so how do we do this?

And we're, so we're going to use the general notion of a factor which we

defined previously. And notice that this really is a general

factor in that the numbers don't even are not even within the range 01.

Now what do these factors mean? These factors have many names.

They're called infinity functions or compatibility functions.

2:24

So what these numbers mean is the, is the local happiness of the variable A and B

to take a particular joint assignment. So here we have the, you know we can see

that the happiest assignment as far as A and B are concerned in isolation of

everything else, is a A0B0. The way this is the case where neither

student has the misconception, that's the happy assignment.

We see that the second happiest assignment is A1B1 where again the

students agree, and in this case they both have the misconception.

And finally, the other two in the middle are are the least happy of all.

Now this is a local happiness and we have similar notions of happiness for for the

other pairs in the, in the graph. So in this case we see not only that

there is a strong sentiment in favor of agreement, it's much stronger than in the

AB case. So B and C really like to agree with each

other. They, you know, it's very difficult for

them to have opposing opinions, okay?

On the other hand Charles and Debbie like to argue with each other all the time.

And so, if one of them says that, it's going to rain today, the other one is

going to say that it's sunny today. And so really you can see that the

preferred assignments for their local opinion is the one that they disagree

with each other. Okay?

And again, A and D like to agree. So this is sort of a, you know describing

a, the overall state by a bunch of little pieces and how we go to put these pieces

together to define a joint probability distribution, we are going to use the

notion of product of factors and so here we, here we are and we're going to take

all these factors and we're going to multiply them together.

That's great. Except that there is, this is in no way,

shape, or form, a probability distribution.

Because its numbers aren't even in the interval, 01 which is why you'll notice

there is a little tilda on top of a p. This tilda formalized measure,

4:38

okay? So how do we turn an unnormalized measure

into a probability distribution? Well, we normalize it.

actually sorry, before that here is the unnormalized measure so that you can see

it here this just to sort of highlight the point.

How do we this unnormalized measure into probability distribution?

We normalize it and that's normalization here.

It has a name. It's called the partition function for

historical reasons that it's comes from its origins and statistical physics and

I'm not even going to describe why it's called that, but that's what it's called.

But you can think of it simply as the normalizing constant that is going to

make all of these sum to one. So we're going to get it by simply

summing up all these entries and it's going to give us the value z.

And if we divide all of these entries by z, we get a normalized probability

distribution and that is the probability distribution as defined by this graph.

5:35

Now, now let's think about what these factors mean, and let's think about this

factor phi one of AB, which is this local happiness between A and B.

And let's think about how it relates to the probability distribution.

So we might think that, this is the marginal probability of A and B in the

joint distribution, or maybe it's the conditional distribution of A given B, or

maybe B given A, or maybe it's the joint probability of A and B given C or D.

The answer is none of the above. Let's go back and look at what this

actually mean in this particular context. So here we have the set of factors that

we used to construct this distribution. And here, trust me, is the marginal

marginal probability of A and D as defined by the set of factors, phi.

I'm going to use a little phi here, to denote the fact that it was derived from

the set of factors phi. Equals phi one [INAUDIBLE] [SOUND] Oh,

come on. Well,

7:04

Let's compare this distribution to the factor phi one.

We can see that not only does this not respect the fact that a and b like to

agree with each other. here A and B like to agree with each

other by a lot. I mean, remember, this is the, this is

three times higher than the next highest value.

Here, that probability has 0.13 and the other high value assignment in, on this

side, the one that had 10, has only 0.04. What is the single highest assignment

here? It's this one.

7:46

Let's think about why that is. This probability distribution is

constructed by multiplying all four of these factors, and if you think about

what's going on here, you see that B really, really likes to agree with C, so

these guys are really closely tied together.

And, actually this should probably be and A

and D similarly like to agree. So these are really, really closely tied

together. These guys, C and D strongly like to

disagree. It's nice to have opposite values.

Now all three of these factors are actually stronger, that is the

differences between the assignments are bigger than in phi one.

So where are you going to break the cycle?

You can't have D agreeing with A, A agreeing with B, B agreeing with C and C

disagreeing with D. It doesn't work.

And so somewhere this cycle has to be, this loop has to be broken, and the place

where it gets broken is A and B because it's the weaker factor.

So the A and B probability is actually some kind of complicated aggregate of

these different factors that are used to compose the Markov network.

And this is actually an important point, because it is going to come back and

haunt us in later parts of the course. There isn't a natural mapping between the

probability distribution and factors that are used to compose it.

You can look at the probability distribution and say ha.

This piece of it is what's [INAUDIBLE] off the beat.

This in direct contrast to Bayesian network, where the.

Actors were all conditional propabilities and you could just look at the

distribution and compute them, here you can't do that and that often turns out to

affect things like how we can learn, these, facts for from data, because you

can't just extract them directly from the propability distribution.

So, with that definition we can, with that intuition we can now go ahead and

define a peerwise Markov network. And I'm defining it explicitly, because

peerwise Markov networks are sufficiently, commonly used as a class of

general Markov networks. That that it's worth giving them their

own place. So a peerwise Markov network.

Is an undirected graph, who loads other random variables.

x11) up to xn.n). And we have edge, xi connecting to xj.

And each one of them is associated with a factor.

Also move into potential. Phi IJ.

Oops. Xi [INAUDIBLE].

Okay? That's what.

this wouldn't be an edge. This would be a comma.

Mm-hm. That's a peerwise Markov network.

And from that. and here's an example of a slightly

larger markup network. This is a markup network that is in the

form of a grid. and we, this is the kind of network

that's used for example when we're doing various operations on images, because

then the variables correspond to pixels for example.

and this is the Markov network that corresponds to the image segmentation

when we'ere using super pixels in which case it's no longer a regular grid.