0:00

In this video, I'll go into more detail about how we can speed up the Boltzmann

Machine Learning Algorithm by using cleverer ways of keeping Markov chains

near the equilibrium distribution, or by using what are called mean field methods.

The material is quite advanced and so it's not really part of the course.

There won't be any quizzes on it and it's not on the final test.

You can safely skip this video. It's included for people who are really

interested in how to get deep Boltzmann machines to work well.

There are better ways of collecting the statistics than the method that Terry

Snofsky and I originally came up with. If we start from a random state, it may

take a long time to reach thermal equilibrium.

0:46

Also, there's no easy tests for whether you've reached thermal equilibrium, so we

don't know how long we need to run for. So, the idea is why not start from

whatever state you ended up in last time you saw that particular data vector?

So, we remember the interpretation of the data vector in the hidden units, and we

start from there. This stored state, the interpretation of

the data vector, is called a particle. Using particles that persist gives us a

warm start and it has a big advantage. If we were at equilibrium before and we

only updated the weights a little bit, it'll only take a few updates of the units

in a particle to bring it back to equilibrium.

1:47

So, here's the method for directing statistics introduced by Radford Neal in

1992. In the positive phase, you have a set of

data specific particles, one or a few per training case. And each particle has a

current value that's the configuration of the hidden units plus which data vector it

goes with. You sequentially update all the hidden

units a few times in each particle with the relevant data vector clamped.

2:31

In the negative phase, you keep a set of fantasy particles.

That is, these are global configurations. And again, after each weight update, you

sequentially update all the units in each fantasy particle a few times.

Now, you're updating the visible units as well.

2:50

And for every connected pair of units, your average, SiSj, over all the fantasy

particles. The learning rule is then the change in

the weights is proportional to the average you got with data, averaged over all

training data, and the average you got with the fantasy particles when nothing

was clamped. This works better than the learning rule

that Terry Snofsky and I introduced, at least for full batch learning.

3:24

However, it's difficult to apply this approach to mini batches.

And the reason is, that by the time we get back to the same data vectorn if we're

using mini batch learning, the weights would have been updated many times.

Son the stored data specific particle for that data vector won't be anywhere near

thermal equilibrium anymore. The hidden units won't be in thermal

equilibrium with the visible units of the particle given the new weights.

3:57

And again, we don't know how long we're going to have to run for, before we get

close to equilibrium again. So, we can overcome this by making a

strong assumption about how we understand the world.

It's a kind of a epistemological assumption.

4:16

We're going to assume that when a data vector is clamped, the set of good

explanations, that is states of the hidden units, that act as interpretations of that

data vector is uni-modal. That means we're saying that, for a given

data vector, there aren't two very different explanations for that data

vector. We assume that for sensory input, there is

one correct explanation. And if we have a good model of the data,

our model will give us one energy minimum for that data point.

4:50

This is a restriction on the kinds of models we're willing to learn.

We're going to use a learning algorithm that's incapable of learning models in

which a data vector has many very different interpretations.

Provided we're willing to make this assumption, we can use a very efficient

method for approaching thermal equilibrium or an approximation to thermal

equilibrium, with the data. It's called a mean field approximation.

5:23

So, if we want to get the statistics right, we need to update the units

statistically and sequentially. And the update rule is the probability of

turning on unit, i, is the logistic function of the total input it receives

from the other units in its bias. Where Sj, the state of another unit, is a

stochastic binary thing. Now, instead of using that rule, we could

say, we're not going to keep binary states for unit i, we're going to keep a real

value between zero and one which we call a probability.

And that probability at time t + one is going to be the output of the logistic

function. The more you put in is the bias, and the

sum of all the probabilities at time t times the weights.

So, we're replacing this stochastic binary thing by a real value probability.

6:27

And that's not quite right because this stochastic binary thing is inside a

non-linear function. If it was a linear function, things would

be fine. But because the logistics non-linear, we

don't get the right answer when we put probabilities instead of fluctuating

binary things inside. However, it works pretty well.

6:52

It can go wrong by giving us biphasic oscillations because now we're going to be

updating everything in parallel. And we can normally deal with those by

using what's called damped mean field where we compute that pi of t1.

+ one. But, we don't go all the way there.

We go to a point in between where we are now, and where this update wants us to go.

So, in damped mean field, we'll go to lambda times the place we are now, plus

one minus lambda times the place the update rule tells us to go to.

And that will kill oscillations. Now, we can get an efficient mini batch

learning procedure for both the machines, and this is what Russ Salakhutdinov

realized. In the positive phase, we can initialize

all probabilities at 0.5. We can clamp a data vector on the visible

units, and we can update all the hidden units in parallel using mean field until

convergence. And for mean field, you can recognize

convergence is when the probability stop changing.

8:14

In the negative phase, we do what we were doing before.

We keep a set of fantasy particles, each of which has a value that's a global

configuration. And after each weight update, we

sequentially update all the units in each fantasy particles a few times.

8:31

And then, for every connected pair of units, we average SiSj, these stochastic

binary things, over all fantasy particles. And the difference in those averages is

the learning rule. That is, we change the weights by an

amount proportional to that difference. If we want to make the updates for the

fantasy particles more parallel, we can change the architecture of the Boltzmann

machine. So, we're going to have a special

architecture that allows alternating parallel updates for the fantasy

particles. We have no connections within a layer, and

we have no skip-layer connections, but we allow ourselves lots of hidden layers.

9:24

And, it's really a general Boltzmann machine with lots of missing connections.

All those skipped layer connections, if they were present.

We wouldn't really have layers at all, it would just be a general Boltzmann machine.

But, in this special architecture, there's something nice we can do.

9:43

We can update the states for example the first hidden layer and the third hidden

layer, given the current states of the visible units and the second hidden layer.

And then, we can update the states of the visible units in the second hidden layer.

And then, we can go back and update the other states,

And we can go backwards and forwards like this.

And so, we can update half the states of all the units in parallel and that'll be a

correct update. So, one question is, if we have a deep

Boltzmann machine like that trained by using mean field for the positive phase

and updating fantasy particles by alternating between even layers and odd

layers for the negative phase, can we learn good models of things like the MNIST

digits, or indeed, a more complicated things?

So, one way to tell whether you've learned a good model is after learning, you remove

all the input and you just generate samples from your model.

So, you run the Markov chain for a long time until it's burned in, and then you

look at the samples you get. So, Russ Salakhutdinov used a eep

Boltzmann machine to model MNIST digits using mean field for the positive phase,

And alternating updates of the layers of the particles for the negative phase.

And the real data looks like this. And the data that he got from his model

looks like this. You can see, they're actually fairly

similar. The model is producing things very like

the MNIST digits so it's learned a pretty good model.

11:25

So here's a puzzle. When he was learning that, he was using

mini-batches with 100 data examples and also he was using 100 fantasy particles,

the same 100 fantasy particles for every mini-batch.

And the puzzle is, how can we estimate the negative statistics with only 100 negative

examples to characterize the whole space? For all interesting problems, the global

configurations base is going to be highly multi model.

12:06

There's an interesting answer to this. The learning interacts with the Markov

chain that's being used to gather the negative statistics, either one that's

used to update the fantasy particles, and it interacts with it to make it have a

much higher effective mixing rate. That means, we cannot analyze the learning

by thinking of it being an outer loop that updates the weights,

And an inner loop that gathers statistics with a fixed set of weights.

The learning is affecting how effective that inner loop is.

12:42

The reason for this is that whenever the fantasy particles outnumber the positive

data, the energy surface is raised, and this has an effect on the mixing rate of

the Markov chain. It makes the fantasies rush around

hyper-actively, And they move around much faster than the

mixing rate of the mark of chain to find better current static weights.

13:10

If there's a mode in the energy surface that has more fantasy particles than data,

the energy surface will be raised until the fantasy particles escape from that

mode. So, the mode on the left has four fantasy

particles and only two data points. So, the effect of the learning is going to be

to raise the energy there. And that energy barrier might be much too

high for a Markov chain to be able to cross, so the mixing rate will be very

slow. But, the learning will actually spill

those red particles out of that energy minimum by raising the minimum.

And we get filled up and the fantasy particles will escape and go off somewhere

else, to some other deep minimum. So, we can get out of minima that the

Markov chain would not be able to get out of, at least, not in a reasonable time.

14:03

So, what's going on here is the energy surface is really being used for two

different purposes. The energy surface represents our model,

but it's also being manipulated by the learning algorithm to make the Markov

chain mix faster. Or rather, to have the effect of a

faster-mixing Markov chain. Once the fantasy particles have filled up

one hole, they'll rush off to somewhere else and deal with the next problem.

An analogy for them is that their like investigative journalists who rush in to

investigate some nasty problem. As soon as the publicity has caused that

problem to be fixed, instead of saying, okay, everything is okay now.

They rush off to find the next nasty problem.