0:00

In the previous video, I showed how a Boltzmann machine can be

used a probabilistic model of a set of binary data vectors.

In this video we're finally going to get around to the Boltzmann machine learning

algorithm. This is a very simple learning model which

has an elegant theoretical justification, but it turned out in practice, it was

extremely slow and noisy, and just wasn't practical.

And for many years, people thought that Boltzmann machines would never be

practical devices. Then we found several different ways of

greatly speeding up the learning algorithm.

And now the algorithm is much more practical, and has, in fact, been used as

part of the winning entry for a million dollar machine learning competition, which

I'll talk about in a later video. The Bolton machine learning algorithm is

an unsupervised learning algorithm. Unlike the typical user back propagation,

where we have a input vector and we provide it with a desired output. In

Boltzmann machine learning we just give it the input vector.

There are q labels. What the algorithm is trying to do is build a model of a set of

input vectors, though it might be better to think of them as output vectors.

1:15

What we want to do is maximize the product of the probabilities, that the Boltzmann

machine assigns to a set of binary vectors,

The ones in the training set. This is equivalent to maximizing the sum

of the log probabilities that the Boltzmann machine assigns to the training

vectors. It's also equivalent to maximizing the

probability that we'd obtain exactly the end training cases, if we ran the

Boltzmann machine in the following way. First, we let it settle to its stationary

distribution, and different times, with no external input.

2:07

Now the main reasons why the learning could be difficult.

This is probably the most important reason.

If you consider a chain of units, A chain of hidden units here, with visible

units attached to the two ends, And if we use a training set that consist

of one, zero and zero, one. In other words, we want the two visible

units to be in opposite states. Then the way to achieve that is by making

sure that the product of all those weights is negative.

So, for example, if all of the weights are positive, turning on W1 will tend to turn

on the first hidden unit. And that will tend to turn on the second

hidden unit, and so on. And the fourth hidden unit will tend to

turn on the other visible unit. If one of those weights is negative, then

we'll get an anti-correlation between the two visible units.

3:01

What this means is, that if we're thinking about learning weight W1, we need to know

other weights. So there's W1.

To know how to change that weight, we need to know W3.

We need to have information about W3, because if W3 is negative what we want to

do with W1 is the opposite of what we want to do with W1 if W3 is positive.

3:28

So given that one weight needs to know about other weights in order to be able to

change even in the right direction, it's very surprising that there's a very simple

learning algorithm, and that the learning algorithm only requires local information.

3:44

So it turns out that everything that one weight needs to know about all the other

weights and about the data is contained in the difference of two correlations.

Another way of saying that is that if you take the log probability that the

Boltzmann machine assigns to a visible vector V.

And ask about the derivative of that log probability with respect to a weight, WIJ.

It's the difference of the expected value of the products of the states of I and J.

When the networks settle to thermal equilibrium with v clamped on the visible

units. That is how often are INJ on together when

V is clamped in visible units and the network is at thermal equilibrium, minus

the same quantity. But when V is not clamped on visible

units, so because the derivative of the log probability of a visible vector is

this simple difference of correlations we can make the change in the weight be

proportional to the expected product of the activities average over all visible

vectors in the training set, that's what we call data.

Minus the product of the same two activities when your not clamping anything

and the network has reached thermal equilibrium with no external interference.

5:11

So this is a very interesting learning rule.

The first term in the learning rule says raise the weights in proportion to the

product of the activities the units have when you're presenting data.

That's the simplest form of what's known as a Hebbian learning rule.

Donald Hebb, a long time ago, in the 1940s or 1950s, suggested that synapses in the

brain might use a rule like that. But if you just use that rule, the synapse

strengths will keep getting stronger. The weights will all become very positive,

and the whole system will blow up. You have to somehow keep things under

control, and this learning algorithm is keeping things under control by using that

second term. It's reducing the weights in proportion to

how often those two units are on together, when you're sampling from the model's

distribution. You can also think of this as the first

term is like the storage term for a Hopfield Net.

And the second term is like the term for getting rid of spurious minima.

And in fact this is the correct way to think about that.

This rule tells you exactly how much unlearning to do.

6:26

Well, the probability of a global configuration at thermal equilibrium, that

is once you've let it settle down, is an exponential function of its energy.

The probability is related to E to the minus energy.

6:41

So when we settle to equilibrium we achieve a linear relationship between the

log probability and the energy function. Now, the energy function is linear in the

weights. So, we have a linear relationship between

the weights and the log probability. And since we're trying to manipulate log

probabilities by manipulating weights, that's a good thing to have.

It's a log linear model. In fact, the relationship's very simple.

It's that the derivative of the energy with respect to a particular weight WIJ is

just the product of the two activities that, that weight connects.

7:23

So what's happening here? Is the process of settling to thermal

equilibrium is propagating information about weights?

We don't need an explicit back propagation stage.

We do need two stages. We need to settle with the data.

And we need to settle with no data. But notice that the networks behaving in

pretty much the same way in those two phases.

The unit deep within the network is doing the same thing, just with different

boundary conditions. With back prop the forward pass and the

backward pass are really rather different. Another question you could ask is what's

that negative phase for. I've already said it's like the unlearning

we do in a Hopfield net to get rid of spurious minima.

But let's look at it in more detail. The equation for the probability of a

visible vector, is that it's a sum overall hidden vectors of E to the minus the

energy of that visible and hidden vector together.

Normalized by the same quantity, summed overall visible vectors.

So if you look at the top term, what the first term in the learning rule is doing

is decreasing the energy of terms in that sum that are already large and it finds

those terms by settling to thermal equilibrium with the vector V clamped so

that it can find an H that goes nicely with V, that is gives a nice low energy

with V. Having sampled those vectors H, it then

changes the weights to make that energy even lower.

9:03

The second phase in the learning, the negative phase, is doing the same thing,

but for the partition function. That is, the normalizing term on the

bottom line. It's finding global configurations,

combinations of visible and hidden states that give low energy,

And therefore, are large contributors to the partition function.

And having find those global configurations, it tries to raise their

energy so that the can contribute less. So the first term is making the top line

big, and the second term is making the bottom line small.

9:43

Now in order to run this learning rule, you need to collect those statistics.

You need to collect what we call the positive statistics, those are the ones

when you have data clamped on the visible units, and also the negative statistics,

those are the ones when you don't have data clamped and that you're going to use

for unlearning. An inefficient way to track these

statistics was suggested by me and Terry Sejnowski in 1983.

And the idea is, in the positive phase you clamp a data vector on the visible units,

you set the hidden units to random binary states,

10:29

We actually did that by starting at a high temperature and reducing it, but that's

not the main point here. And then once you reach thermal

equilibrium, you sample how often two units are on together.

So you're measuring the correlation of INJ with that visible vector clamped.

You then repeat that, over all the visible vectors, so that, that correlation you're

sampling is averaged over all the data. Then in the negative phase, you don't

clamp anything. The network is free from external

interference. So, you set all of the units, both visible

and hidden, to random binary states. And then you update the units, one at a

time, until the network reaches thermal equilibrium, at a temperature of one.

Just like you did in the positive phase. And again, you sample the correlation of

every pair of units INJ, And you repeat that many times.

Now it's very difficult to know how many times you need to repeat it, but certainly

in the negative phase you expect the energy landscape to have many different

minima, but are fairly separated and have about the same energy.

The reason you expect that is we're going to be using Boltzmann machines to do

things like model a set of images. And you expect there to be reasonable

images, all of which have about the same energy.

And then very unreasonable images, which have much higher energy.

And so you expect a small fraction of the space to be these low energy states.

And a very large fraction of the space to be these bad high energy states.

If you have multiple modes, it's very unclear how many times you need to repeat

this process to be able to sample those modes.