0:00

In this video, I'm going to explain how adding noise can help systems escape from

local minima. And, I'm going to show what you have to do

to the units in Hopfield net to add noise in the appropriate way.

I'm not going to introduce the idea that we confined better minima by using noise.

So, Hopfield net always makes decisions that reduce the energy, or if it doesn't

state of the unit, the energy stays the same.

This makes it impossible to climb out of a local minimum.

So, if you look at the landscape here. If we get into the local minimum A,

there's no way we're going to get over the energy barrier to get to the better

minimum B because we can't go uphill in energy.

If we add random noise, we can escape from poor minima, especially minima that is

shallow, that is, ones that don't have big energy barriers around them.

It turns out, rather than using a fixed noise level, the most effective strategy

is to start with a lot of noise which allows you to explore the space on a

coarse scale and find the generally good regions of the space, and then to decrease

the noise level. With a lot of noise, you can cross big

barriers. As you decrease the noise level, you start

concentrating on the best nearby minima. If you slowly reduce the noise, so the

system ends up in a deep minimum, that's called simulated annealing.

And this ideal was, propogated by Kirkpatrick at around the same time as

Hopfield nets were proposed. So, the reason for simulated annealing is

because the temperature, in a physical system, or in a simulated system with a

energy function, Affects the transition probabilities.

So, in a high temperature system, the probability of going uphill from B to A is

lower than the probability of going downhill from A to B.

But it's not much lower. In effect, the temperature flattens the

energy landscape, and so the little black dots are meant to be particles.

And what we are imagining is particles moving about according to the transition

probabilities that you get with an energy function and a temperature.

And this might be a typical distribution if you're on the system of high

temperature where it's easier to cross barriers, but it's also hard to stay in a

deep minimum once you've got that. If you are in the system of much lower

temperature, Then your probability of crossing barriers

gets much smaller but your ratio gets much better.

So, the ratio of the probability of going from A to B versus the probability of

going from B to A is much better in the low temperature system.

And so, if we run it long enough, we would expect all of the particles to end up in

B. But if we just run it for a long time at

low temperature, it will take a very long time for particles to escape from A.

And it turns out a good compromise is to start at a high temperature and gradually

reduce the temperature. The way we get noise in to Hopfield net is

to replace the binary threshold units by binary stochastic units and make biased

random decisions. And the amount of noise is controlled by

something called temperature, Which you'll see in a minute in the

equation. Raising the noise level is equivalent to

decreasing all the energy gaps between configurations.

4:09

As we lower the temperature, Depending on the sign of delta E, the unit

will become either more and more firmly on and more and more firmly off.

At zero temperature, which is what we're be using in a Hopfield net,

Then the sign of delta E determines whether the right hand side goes to zero

or goes to one. But, with T zero, it will either be zero

or one on the right hand side. And so, the unit will behave

deterministically and that's a binary threshold unit.

It will always adopt whatever of the two states is the lowest energy.

So, the energy gap we saw on a previous slide, and it's just the difference in the

energy of the whole system depending on whether unit I is off, or the unit I is

on. Although simulated annealing is a very

powerful method for improving searches that get stuck in local optima, and

although it was influential in leading Terry Sejnowski and I to the ideas behind

Boltzmann machines, it's actually a big distraction from understanding Boltzmann

machines. So, I'm not going to talk about it anymore

in this course even though it's a very interesting idea.

And, from now on, I'm going to use binary stochastic units that have a temperature

of one. That is, it's the standard logistic

function in the energy gap. So, one concept that you need to

understand in order to understand the learning procedure for both the machines,

is the concept of thermal equilibrium. And because we're setting the temperature

to one, this the concept of thermal equilibrium at a fix temperature.

It's a difficult concept. Most people think that it means the system is settled

down and isn't changing anymore. That's normally what equilibrium means. But it's

not the states of the individual units that are settled down.

6:10

The individual units are still rattling around at thermal equilibrium, and less

temperature zero. The thing that settles down is the probability distribution over

configurations. That's a difficult concept the first time you meet it, and so I'm

going to give you an example. The probability distribution settles to a

particular distribution called the Stationary Distribution.

The stationary distribution is determined by the energy function of the system.

And, in fact, in the stationary distribution, the probability of any

configuration is proportional to each of the minus its energy.

A nice intuitive way to think about thermal equilibrium is to imagine a huge

ensemble of identical systems that all have exactly the same energy function.

So, imagine a very large number of stochastic Hopfield nets all with the same

weights. Now, in that huge ensemble, we can define

the probability of configuration as the fraction of the systems that are in that

configuration. So, now we can understand what's happening

as we approach thermal equilibrium. We can start with any distribution we like

over all these identical systems. We could make them all, be in the same

configuration. So, that's the distribution with a property of one on one

configuration, and zero on everything else. Or we could start them off, with an

equal number of systems in each possible configuration.

So that's a uniform distribution. And then, we're going to keep applying our

stochastic update rule. Which, in the case of a stochastic

Hopfield net would mean, You pick a unit, and you look at its

energy gap. And you make a random decision based on

that energy gap about whether to turn it on or turn it off.

Then, you go and pick another unit, and so on.

We keep applying that stochastic rule. And after we've run systems stochastically

in this way, We may eventually reach a situation where

the fraction of the systems in each configuration remains constant.

In fact, that's what will happen if we have symmetric connections.

That's the stationary distribution that physicists call thermal equilibrium.

Any given system keeps changing its configuration.

We apply the update rule, And the states of its units will keep

flipping between zero and one. But, the fraction of systems in any

particular configuration doesn't change. And that's because we have many, many more

systems than we have configurations. So, here's an analogy kust to help with

the concept. Imagine a very large casino in Las Vegas

with lots of card dealers. And, in fact, we have many more than 52 factorial card

dealers. We start with all the card packs in the standard order that they come from

the manufacturer. Let's suppose that has the ace of spades, and the king of spades,

and the queen of spades. And then, the dealers all start shuffling.

And they do random shuffles, they don't do fancy shuffles that bring them back to the

same order again. After a few shuffles, there's still a good

chance that the king of spades will be next to the queen of spades in any given

pack. So, the packs have not yet forgotten where

they started. Their initial order is still influencing

their current order. If we keep shuffling, eventually the

initial order will be irrelevant. The packs will have forgotten where they

started. And, in fact, in this example, there will

be an equal number of packs in each of the 52 factorial possible orders.

Once this has happened, if we carry on shuffling,

There'll still be an equal number of packs in each of the 52 factorial orders.

That's why it's called equilibrium. It's because the fraction in any one

configuration doesn't change, Even though the individual systems are

still changing. The thing that's wrong with this analogy

is that once we've each equilibrium here, all configurations have equal energy.

And so, they all have the same probability.

In general, we're interested in reaching equilibrium for systems where some

configurations have lower energy than others.