0:09

The algorithm is really quite simple once you have seen the equivalents between a

recurrent neural network and a feed forward neural network that has one layer

for each time step. I'll also talk about ways of providing

input, and desired outputs, to recurrent neural networks.

0:49

The key to understanding how to train a recurrent network is to see that a

recurrent network is really just the same as a feed forward network, where you've

expanded the recurrent network in time. So the recurrent network starts off in

some initial state. Shown at the bottom there, times zero.

And then uses the way some of these connections to get a new state, shown at

time one. You then uses the same weights again to

get another new state, and it uses the same weights again to get another new

state and so on. So it's really just a lead feed forward

network, where the weight is a constraint to be the same at every layer.

1:39

Now backprop is good at learning when there are weight constraints.

We saw this for convolutional nets and just to remind you, we can actually

incorporate any linear constraint quite easily in backprop. So we compute the

gradients as usual, as if the weights were not constrained.

And then we modify the gradients, so that we maintain the constraints.

2:04

So if we want W1 to equal W2, we start off with an equal and then we need to make

sure that the changing W1 is equal to the changing W2.

And we do that by simply taking the derivative of the area with respect to W1,

the derivative with respect to W2, and adding or averaging them, and then

applying the same quantity for updating both W1 and W2.

2:28

So if the weights started off satisfying the constraints they'll continue to

satisfy the constraints. The backpropagation through time algorithm

is just the name for what happens when you think of a recurrent net as a lead feet

forward net with shared weights, and you train it with backpropagation.

So, we can think of that algorithm in the time domain.

The forward pass builds up a stack of activities at each time slice.

And the backward pass peels activities off that stack and computes error derivatives

each time step backwards. That's why it's called back propagation

through time. After the backward pass we can add

together the derivatives at all the different time step for each particular

weight. And then change all the copies of that

weight by the same amount which is proportional to the sum or average of all

those derivatives. There is an irritating extra issue.

If we don't specify the initial state of the all the units, for example, if some of

them are hidden or output units, then we have to start them off in some particular

state. We could just fix those initial states to

have some default value like 0.5, but that might make the system work not quite as

well as it would otherwise work if it had some more sensible initial value.

So we can actually learn the initial states.

We treat them like parameters rather than activities and we learn them the same way

as learned the weights. We start off with an initial random guess

for the initial states. That is the initial states of all the

units that aren't input units And then at the end of each training sequence we back

propagate through time all the way back to the initial states.

And that gives us the gradient of the error function with respects to the

initial state. We then just, adjust the initial states by

following, that gradient. We go downhill in the gradient, and that

gives us new initial states that are slightly different.

4:29

There's many ways in which we can provide the input to a recurrent neural net.

We could, for example, specify the initial state of all the units.

That's the most natural thing to do when we think of a recurrent net, like a feed

forward net with constrained weights. We could specify the initial state of just

a subset of the units or we can specify the states at every time stamp of the

subset of the units and that's probably the most natural way to input sequential

data. Similarly, there's many way we can specify

targets for a recurrent network. When we think of it as feed forward

network with constrained weights, the natural thing to do is to specify the

desired final states for all of the units. If we're trying to train it to settle to

some attractor, we might want to specify the desired states not just for the final

time steps but for several time steps. That will cause it to actually settle down

there, rather than passing through some state and going off somewhere else.

So by specifying several states of the end, we can force it to learn attractors

and it's quite easy as we back propagate to add in derivatives that we get from

each time stamp. So the back propegation starts at the top,

with the derivatives for the final time stamp.

And then as we go back through the line before the top we add in the derivatives

for that man, and so on. So it's really very little extra effort to

have derivatives at many different layers. Or we could specify the design activity of

a subset of units which we might think of as output units.

And that's a very natural way to train a recurrent neural network that is meant to

be providing a continuous output.