0:00

We've spent the entire course talking about probabilistic graphical models,

we've talked about variants of probabilistic graphical models on the

presentation site, different inference algorithms that can be used to answer

various types of questions about probabilistic graphical models, and

algorithms that can be used to learn probabilistic graphical models from data.

Now let's take a step back and think about probabilistic models and look at

what they might do for us. So, first,

why probabilistic graphical models? So probabilistic graphical models are in

some sense the synthesis or the marriage of foundational ideas from statistics and

important techniques from computer science.

On the statistics side, we get sound probabilistic foundations that tell us

about probability distributions and how they can be used to think about

uncertainty and various manipulations on the probability distribution such as

conditioning and techniques such as value of information or decision making that

can be used to build on top of probability distributions to make

decisions. From the computer science side, we get

the idea of taking this probability distribution over very high dimensional

space and using ideas such as graphs, as a data structure for exploiting

algorithms that, that. That, allow us to manipulate these high

dimensional objects very efficiently. And it is these ideas that we got from

computer science. That allowed us to take, these sound

probabilistic foundations and make them applicable to the kinds of complex real

world problems that we're now facing, in many scenarios.

Now, a key component of the ability to take a high dimensional probability

distribution and considering, consider it as an object, is the fact that we have

now obtained a declarative representation of our knowledge about the world.

So that declarative representation, which is a model, a probabilistic graphical

model, is a stand alone unit that we can consider separately of any algorithm that

we then use to manipulate it. Now that's important because it allows us

to refine the model, separately from refining the algorithm, that that we're

using. So we can develop the model, and consider

whether it's the appropriate one for application.

And then we can apply one or multiple different algorithms to try and answer

questions regarding that model. It also allows us to separate out the

model construction phase from the inference phase, and the model

construction can involve one or both of elicitation from a domain expert, or

learning from Dana, and we talked extensively about learning methods and

also to some extent about elicitation methods.

Now, when would we apply probabilistic graphical models?

So we would apply probabilistic graphical models when we have noisy data and

uncertainty. That's pretty much true in any

application. So that, that's pretty much a given.

they're also particularly useful when we have lots of prior knowledge that we want

to embed into our model it's often easier to do this in the context of probilistic

graphical models then in a variety of other approaches for example in many

traditional machine learning settings its actually quite difficult to incorporate

prior knowledge in as natural a way as one can do here.

probabilistic graphical models are particularly useful when we have, when we

wish to reason about multiple interrelated variables.

So, if you're trying to predict just a single variable probabilistic graphical

models might not be the most ideal solution and there might be.

A range of other methods that are equally or more competitive from, say,

traditional machine learning. But when you're trying to reason about

multiple variables simultaneously, then the ability to incorporate the

interrelationships between them is often very powerful, and, and here,

probabilistic graphical models can really shine.

Another useful place for probabilistic graphical models brings significant value

is when we want to construct richly structured models that are constructed

from modular building blocks. So when we're trying, for example, to do

fault diagnosis or medical diagnosis, we often have this little building blocks of

models that can recur across different models, for different diagnostic problems

and this architecture allows us to take these building blocks and construct very

rich models from those very quickly by reusing components.

4:44

So, we've talked about probabilistic graphical models in terms of the set of

design choices. There is design choices on the

representation side. the inference algorithm that we choose to

use, and on the learning algorithm that we might adopt should learning be

required. It turns out that these design choices,

although the declarative representation allows us to separate them and consider

each them in isolation, and develop each of them separately, they're not actually

totally separate. And it's important to consider the

interactions between them. So on the representation side it's clear

that which representation we choose affects the cost of both inference and

learning. We've talked about differences in the

cost of inference for different types of, of graphical models and also we saw that

in the learning setting there is very different learnability and complexity

issues when you're learning say directed or undirected models.

The inference algorithm is also interrelated with everything else.

So the inference algorithm for example is used as subroutine in many of the

learning algorithms that we showed, for example in MRF or CRF learning even in

the complete data case and in any setting when we're learning with incomplete data.

And so, thinking about the complexity and properties of our inference algorithm is

critical in designing good learning algorithms.

And we also know that some inference algorithms are only usable in certain

types of models and so we're aiming for a particular inference algorithm that

affects our choice of representation. And finally the learning algorithm also

imposes significant constraints on the modelling.

Because our ability to well identify the, the correct patterns from the data impose

significant constraints on the expressive power, for example, our model.

And only certain types of patterns might be identifiable given the amount of data

we have, and the type of data we have, complete or incomplete, for example.

So lets look at one example of some of these trade-offs in the context of image

segmentation. So here one of the first design decisions

that we need to make is whether we're going to try and model this as a Bayes

net, so it's a directed graphical model.

And MRF or a CRF, where we're trying to predict the segment labels or the, the

pic, the super pixel labels S sub I, given features.

X. So we're trying to do P of S, given X,

rather, than a join distribution. What are some of the tradeoffs that we

might want to consider for this kind of decision?

Well one is the naturalness of the model. So initially just thinking about this.

It seems fairly clear that a directed model where we have directed edges from

one super pixel to the other doesn't seem like a particular natural, a particularly

natural choice in this context because which direction would we pick.

And what implications would this have if we went, for example, using directed

edges from the top right corner of the image to the bottom right corner of the

image, it just doesn't seem to make sense.

And so a more natural modeling paradigm here seems to be the MRF or the CRF.

But that still leaves us the question of which of those should we use.

And here for example, one possible design criterion that we ought to consider is

that we might want to use a very rich feature representation.

That is that these features X should be richly structured and count different

gradients and different gradients of texture and color across the image for

example, different histograms. These rich features are often highly

correlated with each other and we know that the correlated features are hard to

model in the sense of capturing correctly their correlation structure and so in

this case. A CRF seems like it might be a better

choice. Because it avoids the issue of how do we

model the correlation between these, the correlation structure between all these

rich features. And so it allows us to use rich features.

Without falling into traps of incorrectly modeling the correlations, which might

diminish our performance. Now another related question is that we

need to consider is the cost of inference in this kind of model, and that comes

back and influences the modeling. So for example for certain types of

models as we've discussed for example when the when the connectivity when the

structure in the potentials in the MRF or the CRF is associative or regular.

9:49

Then, we know that we can apply very efficient inference algorithms that

utilize, min cuts and other combinatorial graph algorithms that are much, much

faster than applying say, message passing schemes.

So that's all very but then it imposes on us the constraint of making sure that

our, that our potentials in fact are assocative, and that does have

implications about the baility of our model to correctly capture the structure

and the distribution, so for example it is very easy to capture using

associatives potentials the fact that we espect adjacent.

Pixels to have the same class, to have the same label.

That's very easy to capture using this type of potentials, but not quite so easy

to capture is for example the, the constraint, not the constraint but the

tendency that cows are on top of grass. As opposed to on top of water or on top

of road or on top of sky. That's a potential that goes between one

class and the other, and it's very difficult to come up with with a with an

associate to function that actually captures these kinds of targets.

Another thing to consider is training costs when you think about this.

So, here in fact, MRF or CRF are more expensive than Bayesian networks.

Now we might still in this case want to take the hit because of the, because of

the benefits on the modeling side, but we've already seen that in the case of

complete data tr, training a Bayesian network is considerable more

computational effective than training an MRF or a CRF and so, it's a place where

we're taking a pay, we're taking a hit for a design decision we made on the

representational side. And then finally.

if we have to learn with missing data that sets up yet another constraint that

we want to think about and might not be obvious going in.

So say that we have a scenario where the segment labels are not observed.

That is we're trying to learn, to segment.

12:15

In an unsupervised way. So from unsupervised data.

Well, in this case, the S's are unobserved,

they are latent. Which means that optimizing P of S given

X doesn't even make sense because s is never observed.

And so in this case we simply cannot learn a CRF.

In this context because there is no the objective of optimizing the labels S

given the features just is completely undefined in the setting and so at this

point we have to resign ourselves to learning using a different

representational scheme. And so this is another case where we see

the intertwining between different choices that we make and different

constraints that we have in that in this case learning with missing data interacts

with the choice of representation in potentially unexpected ways.

13:19

Now, one important component,

one important benefit of probabilistic graphical modules and their modularity,

is that they allow us to mix and match different ideas within the modeling

framework. And the reason why that's important is

because it gives us a much greater richness of options than by considering

individual topics that we, that we learned as a solution in and of itself.

We can often combine those in interesting ways.

So to see one example of that let's go back to our image segmentation that we

just talked about and think about models that actually misdirected and undirected

edges. So here, we might, for example if we're

trying to learn image segmentation for unsupervised data. We might have

undirected edges over the labels S, as an MRF, just as we see in this diagram,

which is the natural directionality or rather the lack of directionality that we

discussed before. But we have directed edges from the S's

to the features XI. And this is a model that is half undirected, it's undirected

at this level and it's directed going from the S's to the X's.

Now why is that good? It's good because, first, it gives us a

meaningful optimization objective, that is we're trying to optimize using our

model parameters and our model structure, the ability to explain the images that

we've seen using a using the segment characteristics or the class label

characteristics. But why is it,

so in that respect it's a generative model that by forcing us to in some sense

generate the observed data can allow us to learn something about the statistics

or the patterns in, in the distribution. But by allowing, by making this model

directed as opposed to undirected which we could have done by doing an MRF over

the entire set which would also have been a generative model, it actually greatly

facilitates the learning because each of these potentials, P of XI given SI can be

learned separately without trying to optomize over the model as a whole and so

is considerably more efficient. So this is the case where mix and match

is, a, a better strategy than a unified model that is entirely directed or

entirely undirected. Another type of mix and match can occur

on the inference side. We learned about a variety of different

inference algorithms, and we might be tempted to just take a model and just

throw it into a black box of one of those inference algorithms or another.

But it turns out to be much more beneficial in certain cases, to consider

applying different inference algorithms to different pieces of the model.

So for example we might use something like belief propagation or Markhoff chain

Monte Carlo on some subset of variables, but on certain sets of variables we can

do exact inference which gives rise to much more accurate results over those

subsets and perhaps makes the convergence properties of our algorithms considerably

better as well as the accuracy that we get.

So, let's take an example, this is an bipartite gr-, MRF that we have seen

before where we have a set of variable of A on the one side and B on the other as

we have seen before and let's assume that there's

fairly dense connectivity between the variables and maybe even full

connectivity between the a's and the b's. Now, let's think which algorithm we might

want to use for this kind of a model. If there is a very small set of As or a

very small set of Bs, then we can use exact inference, but that's not usually

the case, and so we might have to resort to a proximate inference.

One possibility is to apply belief propagation, but if you look at this

graph, you can see that it has a very large number of fairly tight loops.

And we know the tight loops are a problem for belief propagation and so we might be

worried about the quality of the convergence and the quality of the

answers that we get. Another approach is to use something like

the Markov chain Monte Carlos such as Gibb sampling.

but in this case if we have a lot of A's and a lot of B's then we're sampling over

a large very high dimensional space. And we know that sampling methods in high

dimensional methods have, again, limitations in terms of the rate of

convergence and the quality of the answers.

So one possibility that we might consider here is a variant of of Markov chain

Monte Carlo where we only sample for example only, over the A's, say there's

fewer A's than B's. So we sample over the A's and then for

each assignment over the A's, so for each little assignment, A1.

18:25

of M up to AK then, which is one of our particles, we maintain.

A distribution. Over the Bs, in closed form.

So we're effectively doing exact inference over the B's and sampling only

over the A's which is going to considerably improve the rate of

convergence as well as the statistical properties of our estimator.

And this is just one example that we in that we can talk about other examples of

current belief propagation that we can construct larger clusters in where in

each we run exact inference there's many, many such examples where mixing different

inference algorithms can be tremendously beneficial.

Mix and match on the learning side is equally useful.

So again, we can apply different learning algorithms to different parts of the

model. So, going back to our image segmentation

example, and now we're assuming we're back in the case where we're doing

learning from fully observed data. So we're, going to consider CRF.

it's very commonly used to to train, a high quality.

easily trainable models such as a support vector machine or some kind of logistic

regression for P of S given X. And we can train that very efficiently.

We can use sophisticated machine learning tricks like kernel methods or various

other ideas to get really high accuracy node potentials.

And then fix those and use CRF learning for the higher order say the pair Y's or

even high order potentials and it turns out that when you do that it's actually

less susceptible in many cases to over fitting and its much, much faster to

train. So by putting together different pieces

of functionality different ideas it turns out that we can build some very nice

integrated applications and I just want to end this discussion with one example

of a cool application that puts together a whole range of different of different

ideas into a single framework. So this is a

a work from Microsoft Research, which has a longstanding tradition in the use of

probabilistic graphical models, and it puts together ideas from, from speech

recognition, from diagnosis, and from decision making under uncertainty.

So let's watch the video and then talk about it briefly.

21:04

Hi, thanks for coming to the triage kiosk I am in touch with the best doctors in

the world. Are you here for your child or yourself?

No, for my child. So, how can I help you?

She's been having diarrhea. Oh, no.

Sorry to hear that. Hi there.

What's your name? Manu /g And how old are you?

Five. So, five years old, right?

Yeah. Almost six.

Okay. Six years old.

Have you felt nauseous? I don't know.

So what we can see here is the use of value of information to figure out which

questions ought to be asked, that are going to be the most useful for improving

the quality of the diagnosis and the ability to make decisions in this,

Bayesian network that we see over here on the right.

I'm sorry. Has your child been vomiting at all?

No. Okay.

So lets see, has there been a fever? Slight.

Have you been feeling tired lately? Yes.

Lets see, how about abdominal pains? He's not been complaining about it.

Has your tummy been hurting? Yes.

Oh, I'm really sorry. I hope it will heal better soon. Another

component here is the use of the current model, the current scenarial situation to

guide the speech recognition system so as to get better quality of results from

speech recognition system. So instead of doing totally unguided

speech recognition system, one uses the current scenario as the model, as, as the

system is understanding it to guide the priors on what the person is probably

saying. Okay. Do you happen to know if your child

has been losing weight? No.

That's okay. Have you noticed whether your child's

stool looks funny? Is it black or darker then usual?

It is darker then usual. Okay, let's see.

Considering everything you've told me, I'd like to schedule an appointment with

the doctor just to be sure. However, not to worry.

I am not that concerned at this point. So to summarize.

Probabilistic graphical models provide us with an interpreted coherent framework

for reasoning and learning in complex uncertain domains.

And people over the years have developed an enormous suite of tools that can be

used for accomplishing different tasks within this framework.

And we can put them together within the single unified framework to solve much

harder problems than any single tool can do alone.

We've seen, over the course of different examples in this course that this

framework is used in a huge range of applications and there's many more in

which this can be used. And so overall, we can see that, there's

a lot of places where one can continue to work in this paradigm,

both in utilizing it, for applications that might be of interest and new ideas

that come up and also, improving the foundational methods for representation

inference and learning, that could allow us to address yet new problems.