0:00

In this video I'm going to talk about some advanced material.

It's not really appropriate for a first course on nerual networks but I know that

some of you are particularly interested in the urgent of deep learning.

And the content of this video is mathematically very pretty So I couldn't

resist putting it in. [INAUDIBLE] insight that stacking up

restrictive Boltzmann machines gives you something like a sigmoid belief net can

actually be seen without doing any math. Just by noticing, that a restrictive

Boltzmann machine is actually the same thing as an infinitely deep sigmoid

belief net with shared weights. Once again, wave sharing leads to

something very interesting. I'm now going to describe, a very

interesting explanation of why layer by layer learning works.

It depends on the fact that there is an equivalence between restricted bowlser

machines, which are undirected networks with symmetric connections, and

infinitely deep directed networks. In which every layer uses the same weight

matrix. This equivalence also gives insight into

why contrasted divergence learning works. So an RBM is really just an infinitely

deep sigmoid belief net with a lot of shared weights.

The Markoff chain that we run when we want to sample from an RBM can be viewed

as exactly the same thing as a sigmoid belief net.

So here's the picture. We have a very deep sigmoid belief net.

In fact, infinitely deep. We use the same weights at every layer.

We have to have all the V layers being the same size as each other, and all the

H layers being the same size as each other.

But V and H can be different sizes. The distribution generated by this very

deep network with replicated weights is exactly the equilibrium distribution that

you get by alternating between doing P of V given H, and P of H given V, where both

P of V given H and P of H given V are defined by the same weight matrix W.

And that's exactly what you do when you take a restricted Boltzmann machine, and

run a Markhov chain to get a sample from the equilibrium distribution.

So a top-down pass starting from infinitely higher up.

In this directed note, is exactly equivalent to letting a restricted

Boltzmann machine settle to equilibrium. But that would define the same

distribution. The sample you get at v0 if you run this

infinite directed note, would be an equilibrium sample of the equivalent RBM.

Now let's look at inference in an infinitely deep sigmoid belief net.

So in inference we start at v zero and then we have to infer the state of h

zero. Normally this would be a difficult thing

to do because of explaining away. If for example hidden units K and J both

had big positive weights to visible unit I, then we would expect that when we

observe that I is on, K and J become anti-correlated in the posterior

distribution. That's explaining a way.

However in this net, K and J are completely independent of one another

when we do inference given V0. So the inference is trivial, we just

multiply V0 by the transpose of W. And put whatever we get through the

logistic sigmoid and then sample. And that gives us binary states of the

units in H0. But the question is how could they

possible be independent given explaining away.

The answer to that question is that the model above H0 implements what I call a

complementary prior. It implements a prior distribution over

H0 that exactly cancels out the correlations in explaining away.

So for the example shown, the prior will implement positive correlation stream k

and j. Explain your way will cause negative

correlations and those will exactly cancel.

So what's really going on is that when we multiply v0 by the transpose of the

weights, we're not just computing the light unit term.

We're computing the product of a light unit term and a prior term.

And that's what you need to do to get the posterior.

It normally comes as a big surprise to people.

That when you multiply by w transpose, it's the product of the prior in the

posterior of your computer. So what's happening in this net is that

the complementary prior implemented by all the stuff above H0, exactly counts a

lot explaining why it makes inference very simple.

And that's true at every layer of this net so we can do inference for every

layer and get an unbiased sample with each layer simply by multiplying V0 by W

transpose. Then once we computed the binary state of

H0, we multiple that by W. Put that through the logistic sigmoid and

sample and that will give use a binary state for V1 and so on for all the way

up. Suggestive generating from this model is

equivalent to running the alternating mark off chain on a restricted Boltzmann

machine to equilibrium. Performing inference in this model is

exactly the same process in the opposite direction.

This is a very special kind of sigmoid belief net in which inference is as easy

as generation. So here I've shown the generative weights

that define the model, and also their transposes, that are the way we do

inference. And now I what want to show is how we get

the Bolton Machine Learning Algorithm out of the learning algorithm for directed

Sigmoid belief nets. So the learning rule for Sigmoid belief

net says that we should first get a sample from the posterior, that what the

Sj and Si are, samples from the posterior distribution.

And then we should change a weight, the generative weight in proportion to the

product of the pre activity as J and the difference between the [INAUDIBLE]

activity as i and the probability of turning on i given all the binary states

of the ladder Sj is in. Now if we ask how do we compute Pi,

something very interesting happens. If you look at inference in this network

on the right, we first infer a binary state for H0.

Once we've chosen that binary state, we then infer a binary state for V1 by

multiplying H0 by W, putting the result through the logistic, and then sampling.

So if you think about how Si1 was generated?

It was a sample from what we get if we put H0 through the weight matrix W and

then through the logistic. And that's exactly what we'd have to, to

in order to compute PiO. We'd have to take the binary activities

in H0 and going downwards now through the green weights, W, we will compute the

probability of turning on unit I given the binary states of its parents.

So the point is, the process that goes from H0 to V1 is identical to the process

that goes from H0 to V0. And so SI1 is an unbiased sample of PI0.

That means we can replace it in the learning rule.

So we end up with a learning rule that looks like this, because since we have

replicated weights, each of those lines is the term in the learning rule that

comes from one of those green weight matrices.

For the first green weight matrix here. The learning rule is the presynaptic

state Sj0 times the difference between the post synaptic state Si0 and the

probability that the binary states in H0 would turn on Si.

Which we could call PI0 but a sample with that probability is Si1.

And so an unbiased estimate of the relative, can be got by plugging in Si1

on that first line of the learning rule. Similarly for the second weight matrix,

the learning rule is SI1 into SJ0 minus PJ0 and an unbiased estimate of PJ0 is

SJ1. And so that's an unbiased testament of

the learning rule, for this second weight matrix.

And if you just keep going for all wave-matrices you get this infinite

series. And all the terms except the very first

term and the very last term cancel out. And so you end up with the Boltzmann

machine learning rule. Which is just SJ-zero into Si-zero, minus

SI-infinity into SI-infinity. So let's go back and look at how we would

learn an infinitely deep sigmoid belief net.

We would start by making all the weight matrices the same.

So we tie all the weight matrices together.

And we learn using those tied weights. Now that's exactly equivalent to learning

a restricted Boltzmann machine. The diagram on the right and the diagram

on the left are identical. We can think of the symmetric arrow in

the diagram on the left, as just a convenient shorthand for an infinite

directed net with tied weights. So we first learn that restricted

Boltzmann machine. Now we ought to learn it using maximum

likelihood learning, but actually we're just going to use contrasted divergence

learning. We're going to take a shortcut.

Once we've learned the first restricted Boltzmann machine, what we could do is we

could freeze the bottom level weights. We'll freeze the generative weights that

define the model. We'll also freeze the weights we're going

to use for inference to be the transpose of those generative weights.

So we freeze those weights. We keep all the other weights tied

together. But now we're going to allow them to be

different from the weights in the bottom layer but they're still all tied

together. So learning the remaining weights tied

together is exactly equivalent to learning another restrictive Boltzmann

machine. Namely a restricted Boltzmann machine

with H0 as its visible units, V1 as its hidden units.

And where the data is the aggregated posterior across H0.

That is, if we want to sample a data vector to train this network, what we do

is we put in a real data vector, V nought, we do inference through those

frozen waits, and we get a binary vector at H nought, and we treat that as data

for training the next restricted Boltzmann machine.

And we can go up for as many layers as we like.

And when we get fed up, we just end up with the restrictive Boltzmann machine at

the top which is equivalent to saying, all the weights in the infinite directed

net above there are still tied together, but the weights below have now all become

different. Now an explanation of why the inference

procedure was correct, involved the idea of a complementary prior created by the

weights in the layers above but of course, when we change the weights in the

layers above, but leave the bottom layer of weights fixed, the prior created by

those changed weights is no longer exactly complementary.

So now our inference procedure, using the frozen weights in the bottom layer, is no

longer exactly correct. The good news is, it's nearly always very

close to correct and with the incorrect inference procedure, we still get a

variational bound on the low probability of the data.

The higher layers have changed because they've learned a prior for the bottom

hidden layer that's closer to the aggregated posterior distribution.

And that makes the model better. So changing the hidden weights makes the

inference that we're doing at the bottom hidden layer incorrect, but gives us a

better model. And if you look at those two effects,

we prove that the improvement that you get in the variation bound from having a

better model is always greater than the loss that you get from the inference

being slightly incorrect. So in this variation bound you win when

you learn the lights in hire less, assuming that you do it with correct

maximizer [INAUDIBLE]. So now let's go back to what's happening

in contrasted divergence learning. We have the infinite net on the right and

we have a restricted Boltzmann machine on the left.

And they're equivalent. If we were to do maximum likelihood

learning for the restricted Boltzmann machine, it would be maximum likelihood

learning for the infinite sigmoid belief net.

But what we're going to do is we're going to cut things off.

We're going to ignore the small derivitives for the weights you get in

the higher layers of the infinite sigmoid belief net.

So, we cut it off were that dotted red line is.

And now if we look at the derivatives, the derivatives we're going to get look

like this. They've got two terms.

The first term comes from that bottom layer of nets.

We've seen that before, the router for the bottom layer of weights is just that

first line here. The second line comes from the next layer

of lights. That's this line here.

We need to compute the activities in H1, in order to compute the Sj1 in that

second line but we're not actually computing derivatives for the third layer

of weights. And when we take those first two terms,

and we combine them. We get exactly the learning rule for one

step contrasted divergence. So what's going on in contrasted

divergence, is we're combining weight derivatives for the lower layers, and

ignoring the weight derivatives in higher layers.

The question is, why can we get away with ignoring those higher derivatives?

When the weights are small, the Markov chain mixes very fast.

If the weights are zero, it mixes in one step.

And if the Markoff chain mixes fast, the higher layers will be close to the

equilibrium distribution, i.e. They will have forgotten what the input

was at the bottom layer. And now we have a nice property.

If the higher layers are sampled from the equilibrium distribution, we know that

the derivatives of the log probability, the data with respect to the weights,

must average out to zero. And that's because the current weights in

the model are a perfect model of the equilibrium distribution.

The equilibrium distribution is generated using those weights.

And if you want to generate samples from the equilibrium distribution, those are

the best possible weights you could have. So we know the root is there is zero.

As the weights get larger, we might have to run more iterations of Contrastive

Divergence. Which corresponds to taking into account

more layers of that infinite sigmoid belief net.

That will allow Contrasive Divergence to continue to be a good approximation to

maximum likelihood and so if we're trying to learn a density model, that makes a

lot of sense. As the weights grow, you run CD for more

and more steps. If there's a statistician around, you

give him a guarantee, then in the infinite limit, you'll run CD for

infinite many steps. And then you have an asymptotic

convergence result, which is the thing that keeps statisticians happy.

Of course it's completely irrelevant because you'll never reach a point like

that. There is however an interesting point

here. If our purpose in using CD is to build a

stack of restricted Boltzmann machines, that learn multiple layers of features,

it turns out that we don't need a good approximation to maximum likelihood.

For learning multiple layers of features, CD1 is just fine.

In fact it's probably better than doing maximum l likelihood.