0:00

In this video, I'm going to talk about the storage capacity of Hopfield nets.

Their ability to store a lot of memories is limited by what are called spurious

memories. These occur when two nearby energy minima

combine to make a new minimum in the wrong place.

Attempts to remove these spurious minima eventually led to a very interesting way

of doing learning in things considerably more complicated than a basic Hopfield

net. At the end of the video,

I'll also talk about a curious historical rediscovery where the physicist trying to

increase the capacity of Hopfield nets, rediscovered the perceptron convergence

procedure. Off to Hopfield, invented Hopfield nets as

memory storage devices. The field became obsessed by the storage

capacity of a Hopfield net. Using Hopfield Storage Rule for a totally

connected net, the capacity is about 0.15N memories.

That is, if you have N binary threshold units, the number of memories you can

store is about 0.15N before memories start getting confused with one another.

So that's the number you can store and still hope to retrieve them sensibly.

1:35

This doesn't make efficient use of the bits that are required to store the

weights. In other words, if you look at how many

bits the computer is using to store the weights, it's using well over 0.15N

squared2 bits to store the weights. And therefore, this kind of distributed

memory and local energy minima is not making efficient use of the bits in the

computer. We can analyze how many bits we should be

able to store if we were making efficient use of the bits in the computer.

Those N squared weights and biases in the net.

And after storing M memories, each connection weight has an integer value in

the range -M to M. That's because we increase it by one or decrease it by one

each time we store a memory, assuming that we used states of -one and one.

Now, of course, not all values will be equiprobable, so we could compress that

information. But ignoring that, the number bits it

would take us to store a connection rate in a naive way is log 2M + one, Cuz that's

the number of alternative connection rates and that's a log to the base two.

And so, the total number of bits of computer memory that we use is of the

order of N squared log 2M + one. So, notice that, that scales

logarithmically with M. Whereas, if you store things in the way

that Hopfield suggests, you get this constants 0.15 instead of something this

scale is logarithmically. So, we're not so worried about the fact

that the constant is a lot less than two, What we're worried about is this

logarithmic scaling. That shows we ought to be able to do

something better. If we ask, what limits the capacity of a

Hopfield net? What is it that causes it to break down? Then, its merging of energy

minima. So, each time we memorize a binary

configuration, we hope that we'll create a new energy minima.

So, we might have a state space for all the states of the net being depicted

horizontally here, and the energy being depicted vertically.

And, we might have one en, energy minimum for the blue pattern and another for the

green pattern. But, if those two patents are nearby, what

will happen is we won't get two seperate minima. They'll merge to create one

minimum at an intermediate location. And that means, we can't distinguish those two

seperate memories, and indeed we'll recall something, that's a blend of them rather

than the individual memories. That's what limits the capacity of a

Hopfield net, that kind of merging of nearby minima.

4:22

One thing I should mention is this picture, is a big misrepresentation. The

states of a Hopfield matter really, the corners of a hyper cube. And, it's not

very good to show, the corners of a hyper cube, as if they were continous one

dimensional horizontal space. One very interesting idea that came out of

thinking about how to improve the crest of the Hopfield net is the idea of

unlearning. This was first suggested by Hopfield,

Feinstine and Palmer, who suggested the following strategies.

4:55

You left a net settle from a random initial state, and then you do unlearning.

That is whatever binary state it settles to, you apply opposite of the storage

rule. I think you can see that with the previous

example, that red merge minimum. If you let the net settle there and did

some unlearning on that merge minimum, you'd get back the two separate minima cuz

you'd pull up that red point. So, by getting rid of deep spurious

minima, we can actually increase the memory capacity.

5:38

Francis Crick, one of the discovers of the structure of DNA, and Graham Micherson,

proposed that unlearning might be what's going on during REM sleep, that is Rapid

Eye Movement sleep. So, the idea was that during the day, you

store lots of things, and you'll get spurious minima.

Then at night, you put the network in a random state, you settle to a minimum,

And you unlearn what you settled to. And that actually explains a big puzzle.

This is a puzzle that doesn't seem to puzzle most people that study sleep but it

ought to. Each night, you go to sleep and you dream

for several hours. When you wake up in the morning, those dreams are all gone. Well,

they're not quite all gone. The dream you had just before you woke up, you can get

into short term memory and you'll remember it for a while. And if you think about it,

you might remember it for a long time. But, we know perfectly well that if we'd

woken you up at other times in the night, you'd have been having other dreams, and

in the morning their just not there. So, it looks like you're simply not

storing what you're dreaming about, and the question is, why? In fact, why do you

bother to dream at all? Dreaming is paradoxical and that the state

of your brain looks extremely like the state of your brain when you're awake,

except that it's not being driven by real input. It's being driven by a relay

station just after the real input called the thalamus.

7:20

But, there's another problem with unlearning, which is more mathematical

problem, Which is, how much unlearning should we do?

Now, given more you've seen in the school so far, a real solution to that problem

will be to show that unlearning is part of the process of fitting a model to data.

And, if you do maximum likelihood fitting of that model, then unlearning will

automatically come out and fit into the model.

And what's more, you'll know exactly how much unlearning to do.

8:07

Before we get to that, I want to talk a little bit about ways that physicists

discovered for increasing the capacity of the Hopfield net.

As I said, this was a big obsession with the field.

I think it's because physicists really love the idea that math they already know

might explain how the brain works. That means, post doctoral fellows in

physics who can't get a job in physics might be able to get a job in

neuroscience. So, there are a very large number of

papers published in physics journals about Hopfield and their storage capacity.

Eventually, a very smart student called, Elizabeth Gardner, figured out that

there's actually a much better storage rule if you were concerned about capacity.

And that it would use the full capacity of the weights.

And I think this storage rule will be familiar to you.

8:56

Instead of trying to store vectors in one go, what we're going to do is we're going

to cycle through the training set many times. So, we lose our nice online

property that you only have to go through the data once. But in return, we're going

to gain, more efficient storage. What we going to do is we going to use the

perceptual convergent procedure to train each unit to have the correct state given

the states of all the other units in that global vector that we want to store.

So, you take your net, you put it into the memory state you want to store, and then

you take each unit separately and say, would this unit adopt the state I want for

it, given the states of all the other units?

If it would, you leave its incoming weights alone.

If it wouldn't, you change its incoming weights in the weights specified by

convergence procedures. And notice, these would be integer changes to the weights.

9:55

You may have to do this several times, and of course, if you give it to many

memories, this won't converge. You only get convergence with a perceptron

convergence procedure if there is a set of weights that will solve the problem.

But assuming there is, this is much more efficient way to store memories in a

Hopfield net. This technique is also being developed in

another field, statistics. And statisticians call the technique

pseudo-likelihood. The idea is to get one thing right given

all the other things. And so, with high dimensional data, if you

want to build a model of it, the idea is you build a model that tries to get the

value on one dimension right given the values on all the other dimensions.

The main difference between the perceptron convergence procedure that's normally used

and pseudo-likelihood is that, in the Hopfield net, the weights are symmetric.

So, we have to get two sets of gradients for each weight and average them.

But, apart from that, the way to use the full capacity of a Hopfield net is to use

the perceptron convergence procedure and to go through the data several times.