0:00

In this video I'm going to describe eco-state networks.

These use a clever trick to make it much easier to learn a recurrent neural

network. They initialize the connections in the

recurrent neural network in such a way that it has a big reservoir of coupled

oscillators. So if you provide input to it, it converts

that input into the states of these oscillators, and then you can predict the

output you want, from the states of these oscillators.

And the only thing you have learn, is how to couple the output to the oscillators.

0:43

This entirely gets rid of the problem. Of learning hidden to hidden connections

or even input to hidden connections. However, to get these networks to be good

at complicated tasks, you need a very big hidden state.

As we'll see at the end of the video, There's no reason not to use the

initialization that was carefully designed for echo state networks, And then to use

back propagation through time with momentum to train the networks to be even

better at the tasks that they're doing. One interesting and quite recent idea

about training recurrent neural networks, is to not train the hidden to hidden

connections at all, but to just fix them randomly, and hope that you can learn

sequences by just training the way they affect the outputs.

1:48

So a very simple way to train a feedforward neural network, is to make the

early layers of feature detectors just be random.

You put in sensible sized random weights and then all you learn is the last layer

so that you're learning a linear model from the activities of the hidden units in

the last layer to the outputs. And of course it's much faster to learn a

linear model. This relies on the idea that a big, random

expansion of the input vector, can often make it easy for a linear model to fit the

data, when it couldn't fit the data well, just looking at the raw inputs.

Through the little neural network here, those red weights will be fixed at random.

They would expand the input vector and then using that expanded representation,

we try and fit a linear model. This actually has some quite strong

similarities with support vector machines. Which are really just a really efficient

way of doing this. So those same ideas, many years later,

were recycled for recurrent neural networks.

The idea is to make the input to hidden connections.

And the hidden to hidden connections have random values that are carefully chosen.

And just learn the final layer of hidden to output connections.

The learning is then very simple if you use linear output units.

And it can be done extremely fast. This approach is only ever going to work

if you set the random connections very carefully, so that the recurring neural

network doesn't die out with no activity and doesn't explode.

So, the way they set the random connections in a echo state network is

they set the hidden to hidden weights so that the length of the activity vector

stays about the same after each duration. For those of you used to linear systems

and matrices, you're setting it so the spectral radius is one.

That is the biggest eigenvalue of the matrix of hidden to hidden weights is one.

Or it would be one if it was a linear system.

And you want to achieve the same property in a non-linear system.

If you set those weights to be about the right magnitude, then an input can echo

around in the recurrent state for a long time.

4:20

So instead of having lots of medium size weights, we have a few quite large

weights. And nearly all the weights are zero in the

hidden to hidden connections. What this does is it makes a lot of

loosely coupled oscillators. So information can hang around in one part

of the net without being propagated to other parts of the net too quickly.

It's also important to choose the scale of the input to hidden connections very

carefully. Those connections need to drive the states

of the loosely coupled oscillators but, they mustn't wipe out information that

those oscillators contain about the recent history.

5:01

Fortunately the learning is very fast in echo state networks so we can afford to

experiment with the scales of the important connections.

You could think of it as a little learning loop that's just learning the scales of

those connections and it's doing it by sort of feedback that involves the

experimenter. It also helps to learn the level of

sparseness that's needed in the hidden to hidden connections, and again because the

learning is so fast, you can afford to experiment with that.

5:35

That's important because it's often necessary to do those experiments to get

the system to work well. So I'm now going to show you a simple

example taken from the web of an eco-state network.

It has an input sequence which is a real value that varies with time, and specifies

the frequency of a sine wave for the output of the eco-state network.

So you'd like this thing to generate sine waves, and the input is gonna specify the

frequency. The target output sequence is going to be

the same wave with the frequency specified by the output.

And it's going to be learned simply by putting a linear model that takes the

states of the hidden units and from those tries to predict the correct scalar output

value. So here's a picture taken from

scholarpedia of an echo state network doing this program, the input signal is

the desired frequency of the sine wave. The output signal after it's learned, or

the teacher signal, when it's learning, is a sine wave with the frequency specified

by the input. And the stuff in the middle is a big

dynamical reservoir, so that the inputs coming from the input signal driver those

loosely coupled oscillators, and cause complicated dynamics that goes on for a

long time. And those output weights are learning to

map that complicated dynamics to the particular dynamics you want for the

output. All the other pictures are showing you the

actual dynamics of individual units inside the dynamical reservoir.

One thing to notice is that there are also connections from the output back to the

reservoir. Those aren't always needed, but they help

to tell the reservoir what have has been produced so far.

So here's an example of what they system actually produces after it's learned, and

you can see that at the beginning it's producing a sign wave, in phase.

At the end, it's producing a sign wave of the right frequency, but the phase is

wrong. And that's because we weren't telling what

phase the sign wave should be in. So it's satisfying the requirements of

producing an appropriate frequency. There some very good aspects of echo state

networks. They can be trained very fast because they

just fit in a linear model. They also demonstrate how important it is

to initialize the hidden-to-hidden weight sensibly.

And they can do quite impressive modeling of one dimensional time savers.

That's where they excel. They can look at a time series for awhile,

and then predict it very well a long time into the future.

8:31

What they're not so good at is modeling high dimensional data, like frames of

acoustic coefficients, or frames of video. In order to model data like that, they

need many more hidden units than a recurrent neural network where you train

the hidden to hidden connections. Recently, Ilya Sutskever tried something

which is fairly obvious which is to initialize a recurrent neural network

using all the tricks developed by the people doing echo state networks.

Once you've done that, you know you could learn quite well just by learning the

hidden driver connections. But then, presumably, you could learn even

better if you also learn to make the hidden to hidden weights better.

So Ilya tried using the echo state network initializations but then training with

back propagation through time. He used rmsprop with momentum and he

discovered that, that is actually a very effective way to train recurrent neural