0:00

[MUSIC]

In this video,

we're going to cover sampling from one-dimensional distributions.

And let's start with the simplest case, a one-dimensional discrete distribution.

So say we have a random variable like this,

which takes the value of a1 with probability 0.6,

value a2 with probability 0.1, and the value a3 with probability 0.3.

How can we sample from it?

Well, we have to start from something.

It's hard to generate randomness from nothing on a deterministic computer.

So we're going to use this kind of function which can generate your random

number on the interval from zero to one.

And this function is included in almost any possible programming language's

library.

So it's usually there, and

we can use it to base our further sampling techniques on it.

But anyway, we have this function.

How can we generate a sample from this discrete distribution?

Well, let's start with a interval and let's decompose it into three chunks,

where the first chunk has length 0.6,

the second chunk has length 0.1, and the third chunk has length 0.3.

And let's say that we're going to generate a random number in this interval.

And let's say that if it appears in the first chunk, in the right one,

then we will say that we have just generated the value of a1.

So let's generate a random number from our function which just [INAUDIBLE]

point on this interval.

And let's say that we were to be somewhere around 0.9 for example.

It means we have just generated the value of a3.

So, why does it work?

Well, because, obviously, if you throw a random point on an interval,

then the probability that it's appeared in some chunk is

proportional to the length of the chunk.

And the length of the first chunk is 0.6,

which is exactly the probability we wanted to, a1 to be generated, and etc.

So this indeed is the correct procedure for generating random numbers

from discrete distributions with finite number of values.

And the bottom line here is that it's really easy to generate samples

from discrete distributions.

But note that here we had just three values.

If we for example had 1 million values, or 1 billion values, it would be much harder.

Because we would have to spend amount of time proportional to

1 billion to separate this interval into 1 billion chunks of

appropriate sizes and then to find in which particular interval,

in which particular chunk the random value appeared.

But if you have finite number of discrete events and this number is not that large,

it's like 1,000, then you can really efficiently and

easily generate samples from this distribution.

Okay, let's move to continuous distributions.

So let's say for beginning that we want to

sample a point from a Gaussian, like this.

How can we do it?

Well, one way is to use the central limit theorem.

That says that if we have a few independent random values, and

if we sum them up together, we'll get something like Gaussian.

So if we just throw, say, 12 random numbers from the uniform

distribution on an interval from 0 to 1 and then add them all together,

we'll get something with expected value being 6.

Because expected value of each x i is 0.5, it's the center of the interval.

And the sum of 12 x i's will be will be from 0 to 12.

And the center of this interval is 6.

So I have to subtract 6 to make the expected value of this

distribution to be 0.

And in this way, our distribution will be actually not from minus infinity to

plus infinity like in the Gaussian case, but rather from minus 6 to 6.

So it's indeed an approximation, it can't generate you anything like

an actual Gaussian, but it's a pretty close one.

So although it cannot ever generate for example, a number of 100,

but the Gaussian itself is also very, very improbable to generate you 100.

So we can be okay with losing these highly improbable parts of your sample space.

And you may ask, why 12 here?

Well, just to make the variance of the resulting

random variable z to be equal to 1 as of standard Gaussian.

If you use for example not 12 but, I don't know, 100 values,

you would have to divide this thing by something to make its variance to be 1.

Or if you don't want to use these kind of loose approximations,

you can ask your favorite programming language for a function which

just generates you numbers from Gaussian, and it probably has it.

So it's a pretty standard thing to have in a programming language

to generate numbers from a standard Gaussian.

And this is kind of more efficient and

more accurate than just generating it from central limit theorem.

So, okay, Gaussians are easy.

But what if we have a more complicated distribution like this and

we don't know how to sample from it?

Again, as we discussed in the previous video, it can happen if for example,

this is a posterior distribution on some parameters or some latent variables, and

we may even know it up to normalization constant, not exactly.

So what do we do in this situation?

6:14

Well, we know how to sample from Gaussians.

So let's try to use it.

Let's upper bound our distribution with some Gaussian times

some constant, for example 2 in this example.

Because it's impossible to upper bound one distribution with another

one without a constant, without multiplying it with something.

Okay, so we have a distribution P of x, the blue curve,

which we want to sample from.

And we have the orange curve, which is 2 times some Gaussian,

which we can sample from.

So how can we generate samples from the blue curve?

Well, let's start with generating a sample from the orange curve, which we can do.

Now we, for example, with high probability, can get a number around 0.

Because it's around 0, it's somewhere where the orange density is high.

So there will be quite a few points there.

But the blue density is not high there, it's kind of a local minimum,

so we have to somehow correct for this.

We have to, for example, throw away some of the points which were generated in this

region where the actual blue probability is low.

So let's do just that.

Let's reject some of the points generated from the orange curve.

And what is the rule for ejecting the point?

Well, first of all, it should be probabilistic.

Because you can't just reject all points in some region.

Otherwise, the resulting distribution of your

accepted points will have 0 probability in this region.

So we have to reject some of the points in some kind of probabilistic manner.

And the probability of each one to reject the particular point, so

point x tilde here should be proportional to the height of the blue curve and

with respect to the height of the orange curve.

So if in the current point the two curves coincide,

they have the same values, you don't want to reject anything.

You are happy to accept the point which was just generated.

If for example, at some region, the orange curve is high, but

the blue curve is 0, then you don't want to accept anything.

You want to reject all the points in these regions.

If it's somewhere in between, if for example, like here we have around 0.7,

proportion of the blue curve to the orange one around 0.7,

then we'll expect to keep 0.7, so

70% points in this region from the ones we were sampling.

So we can say that we have generated x tilde, which is from the orange curve.

Then we generate its y coordinate,

which is from 0 to the orange curve at the current point.

And then we'll reject this point if it appeared

in the red region so above the blue curve.

And we'll accept this point if it appeared below the blue curve.

9:33

You may ask, why does it work?

Well, if you look at the points that were generated by this procedure, before

rejecting anything, they will be uniformly distributed under the orange curve.

When we apply our rejection procedure,

we throw away the points that appeared

above the blue curve and below the orange one.

And what is left is the points that are uniformly distributed under the blue

curve.

And it's kind of natural that these points,

their x coordinates are distributed as the blue curve suggests.

So that's great, it's a procedure to generate samples from any

distribution you can upper bound with something you can sample from.

But how efficient is it?

So if you want to generate one point, how many points you will have to

generate from the orange curve before you accept one of them?

Well, the answer to this question is just 1 divided by the constant.

So one-half in our example, n1 divided by m in the general case.

And the reason for that is just because it's the ratio between

the area under the blue curve, which is 1 because it's a probability distribution.

And the area under the orange curve,

which is m because it's m times a probability distribution.

So if we throw, informally, points in some area and

then reject some of them, then the probability of accepting

the point is proportional to the area where we accept them.

And note that here we can even use this method if we know our distribution after

normalization constant as it usually happens in probabilistic modeling.

So in this case, if you can still upper bound your unnormalized distribution,

you can say that this constant m, this tilde m,

just includes the normalization constant somehow.

So you don't know the normalization, but

you can upper bound the distribution with some constant and this will still work.

So, as a summary, this method is called rejection sampling.

And it's kind of universally applicable, so it's really often that you can

upper bound your distribution and generate samples from it like this.

But the downside is that if q and p are too different from each other,

then you have to choose large constant m and

then you will have to reject most of the points.

And so if, for example, m is 100, then to generate just one point from your actual

distribution, the blue curve, you'll have to generate 100 points and

then on average just one of them will be accepted.

And another problem with this approach is that if you

apply to multidimensional case, it will still work,

but the constant m will probably be very, very large.

So it's kind of exponential in the number of dimensions.

And it's really improbable that you will be able to upper-bound your complicated

distribution in ten-dimensional space with a constant smaller than, for

example, 1 million.

And having a constant 1 million means that to generate just one point,

you have to wait a million computational cycles, which is a lot.

So this is a method for one dimension or maybe a few dimensions,

but not like thousand-dimensional spaces.

[MUSIC]