0:00

A very different paradigm to inference in graphical models is the use of what's

called sampling or particle based methods.

In those methods, rather than trying to manipulate the exponential or even

infinitely large probability distribution as a whole, we randomly sample instant,

instances from that distribution and then use those instances as a sparse

representation. We can use those instances then to

estimate various quantities that we care about regarding the statistics of the

overall distribution. Before we show how this is used in the

context of graphical models, let's go ahead and see how this is applied in the

simple setting. So first of all, how do we use samples

for estimating quantities that we care about?

So imagine that somehow, and we're going to talk about how that might happen

somebody gives us or we manage to construct a data set d which consist of a

bunch of samples from distribution P. And for the moment we're going to assume

that these samples are what's called IID which stands for independent and

identically distributed. That's where the IID comes from.

And now, let's imagine that we're trying to compute or estimate.

And this is all going to be approximate. approximate sum, quantity of the

distrib-, of, of the distribution P. And so, for the moment, let's focus on

the really simple case, where we're trying to figure out, where these are

binary valued random variables. So the distribution P that we're sampling

from is one where the probability that x = 1 is P.

And so, think of this like tossing a coin.

Now when are they reasonable estimator for the for this parameter P that we are

trying to estimate. The probability that X falls heads.

Although this is a fairly intuitive answer here, this estimator which we're

going to call P sub d. Is, simply obtained by looking at all of

the samples that we got and counting the fraction of ones.

2:23

And if you think about it, that's a perfectly reasonable approximation.

Right? Now, more generally, for any distribution

P, and any function f whose statistics we're

trying to estimate. We can approximate the expectation of the

function f relative to the distribution P in terms of this weighted average of the

value of f on the samples. And this is often called the Empirical

Expectation. So this is how we might, if we had some

way of which we have not yet talked about of constructing or sampling from a

distribution. How we might use those samples for

approximating various properties relative to the distribution p.

Whether it's the probability of an event or the expectation of a function.

And know by the way an expectation of a function subsumes the probability of an

event because you can also take the expectation for example of an indicator

function and that corresponds directly probability. So for example, this

probability can be viewed as the expectation relative to P of the

indicator function representing x1. = 1.

9:48

And once again we have an exponential we have a bound on the error, which is

written over here. And once again, the number of samples

appears in the exponent that makes us happy.

We have the same type of epsilon squared term, that shows that the dependence on

the tolerance that's required. But her we have this one other term over

here, which is the actual value p to which we're trying that we're trying to

estimate. So if we can give you a bound on the

error that you get the this bound on on how far you are away from p.

A bound on the tolerance epsilon. And that also the bound on the

probability with which you get a bad sample set.

We can now say, for a given tolerance that we want.

I'm sorry for a given error probability that we want, that is.

How, if we want to guarantee that we're within epsilon, with probability that's

greater than one minus delta, we need this many samples.

And that's fairly straight forward algebra.

You simply, say this is less than or delta.

And then just take logs, and move things around and it all comes out, to be

exactly this. And for the turn off bound we have, a

similar expression. and, that gives us exactly that same kind

of, bound on m as a function of epsilon in delta, and in this case p as well.

So this looks great. Right?

We can you, you give me an epsilon which is your air tolerance and a delta which

is the probability which you are willing to take being wrong.

And I can say if you can only sample this many samples m then I can give you those

probabilistic guarantees. They're all deterministic but they're,

but they're pretty solid. Why is this not a perfect solution to our

inference problems? Because each of these has significant

limitations. So, let's think about the first which is

our additive bound. And let's imagine that you're going in to

a doctor and you're saying, you know what is the probability that I have some

horrible disease. Well, that probability hopefully for you,

is still pretty low. So maybe, if you're unlucky it's, I don't

know, 1% or 2%.. Well, an additive error epsilon on 1%,

the, the epsilon that you need to get something that's meaningfully bounded

when the true probability p is 1%.. The epsilon needs to be really, really

small. You can't do epsilon equals 0.01.

Because that could move you up from 1% to 2%..

And that's a factor of two increase in your probability.

And that's something that people really care about the difference between 1% and

2%. so you'd need something that's more like

0.0001 or maybe 00001 depending on, you know,

how confident you wanted to feel. And now, this epsilon2.

squared over here it was beginning to look pretty, pretty scary in terms of the

number of samples that are required to get a reasonable doubt.

So you might come and say well, fine. Let's use the turn off bound, because

that gives me relative errors on epsilon. So now if epsilon is, is, sorry, sorry on

p. So if now p is small, then by all means I

can go ahead and just, you know, have a, it, it's a multiplicative factor of p, so

I can say p plus or minus 1% of p. Well, unfortunately, there's no free

lunch because if p's small, notice that it appears here in the denominator, and

so once again we need a number of samples that could potentially be quite large

when we're dealing with small probabilities.

So the main message from this is that sampling-based estimation.

Is a reasonable thing to do when p is not too small.

When p is not too small this works fine. When P p begins to get smaller,

we the, the, the tractability of this is more in doubt.

Now that we understand when we might expect this to work let's think about how

we might apply it in the context of Bayesian Networks.

So here is our little baby network that we've used for the student example.

And what we'd like to do is we'd like to generate samples from this distribution.

So this is the distribution of, remember, P( D, I, G, S, L).

And we'd like to sample from this high dimensional distribution.

Not that high, but still. And the way in which this is done is

actually very natural. When you think about the sort of, causal

intuitions, or forward flow of a Bayesian network.

We start out, say, by sampling the difficulty variable.

And the difficulty variable was sampled from this distribution over here which

is, 0.6 versus 0.4. And so, we use the trick that we showed a

couple slides ago. And say, it comes out to be zero.

I'm going to write down the zero over here.

Now I'm going to sample I, and I'm going to toss a coin with probability 0.7 0.3,

and say I won. Now I get the sample grade.

And because I previously sampled difficulty and intelligence,

I know exactly what distribution grade needs to be sampled from.

It's the distribution I1D0. And so I now sample one of the three

values of grade from the distribution in this row that I've picked out in the

[INAUDIBLE]. And say, it comes out G1.

17:04

And so if I want to use this process, which is very naturally called forward

sampling. Where we say, want to compute some, to

estimate the probability of some assignment, little y, to some set of

query variables, y. Then what we're going to do is we're

going to generate a bunch of samples from that Bayes net.

As many as we think is, is adequate. And then, if I'm interested in some

particular event. Then I simply compute the fraction of my

samples. Fraction of samples where y equals y.

And I can use that same procedure for computing other expectations, so

whatever, if I, if I have any other function of the sample, I can compute the

sample, the function on that sample and then average it out in exactly the same

way that we showed on the first slide. And, and now we can go ahead and apply

the bounds that we showed before. So for an additive bound we have this

many samples that we need. And for multiplicitive bound, we have

this many samples that we need. And these are just copying over the two

bounds that we showed before. So, that's great.

but that, notice, gave me queries without evidence.

And what do we do if we want to now ask a question.

Not just about the probability of y. But the probability of y, given some

evidence, E = e. Well, so now,

it's not that easy to sample from a Bayesian network if I have evidence

that's not at the root. If I have evidence that's at the root,

then that's fine. I know the value of that variable, and I

can just stick that in, and ignore it. Ignore everything else.

But if I have a variable where, that I observe that, somewhere in the middle or

the bottom of the network. Then what happens when I get there?

Do I sample it? Do I ignore it?

I mean it seems a bit unclear. And so the simple solution to that.

And we'll talk about others. Is an algorithm that is called rejection

sampling. And what it does, is it does the most

naive, thing that you could possibly imagine.

I generate samples from a basna. And then, if they don't agree with my

evidence. I throw'em out.

20:42

So now, let's go back to, for example a medical diagnosis setting or, or for that

matter the the, the, message decoding example that we talked about previously.

How likely is your average evidence. So let's think about medical diagnosis.?

A person comes in, and they have their age, and their weight, and their gender.

And you know, 32 symptoms, and ten test results.

And you know, the fact that they were, they went on a trip to Europe last week.

And all of these are pieces of evidence. And the question is how likely is that

configuration of evidence? And the answer is vanishingly small.

How likely is your random person off the street to exhibit this exact combination?

Almost null. And similarly with the message example

how likely is, are you to get up a particular configuration of noisy bits

exponentially small also. So the number of samples needed grows.

Grows exponentially with the number of observed variables.

The more variables you have you've observed, the lower the probability of

the evidence. And this is a exponential decay.

And so that means that if you observe more then a couple variables, basically

this is not a good algorithm. So to summarize algorithmically, this is

a very simple procedure. It's very easy to generate samples for a

Bayesian network and we showed how all these pieces fit together.

And once you've done that, there's really elegance and theoretically compelling

epsilon delta bounds, as we've shown before; but their usefulness,

unfortunately, is limited. The additive bounds tend to be useless

for low probability events. The multiplicative bounds.

are not good either, because the number of samples grows as a function of one

over the probability of the event that we're trying to estimate.

And all of this is completely irrelevant as soon as you have evidence.

Because as soon as you have evidence the number of required samples grows

exponentially with the number of observed variables.

Which means this is not a feasible solution for most practical settings.

And one final note that's worth highlighting.

This is the only case in, I think this, the entire section on inference, where I

talked about bayesian networks specifically.

And that is because, forward sampling, unlike any of the other inference

algorithms that we are going to talk about, is just not feasible for markup

networks. Because there's no, you know, there's no

notion of starting at the root. There is no root.

There is no, sort of Any variable that you would naturally

start within whose probably you actually know.

and so we don't apply forward sampling to Markov networks.