0:00

In this video, I am going to give an overview of various types of models that

have been used for sequences. I'll start with the simplest kinds of

model, which is ultra aggressive models, that just try and predict the next term or

the sequence from previous terms. I'll talk about more elaborate variants of

them using hidden units. And then I'll talk about, more interesting

kinds of models, that have hidden state, and hidden dynamics.

These include linear dynamical systems and hidden Markov models.

Most of these are quite complicated kinds of models, and I don't expect you to

understand all the details of them. The main point of mentioning them is to be

able to show how recurrent your own networks are related to models of that

kind. When we're using machine learning to model

sequences, we often want to turn one sequence into another sequence.

For example, we might want to turn English words into French words or we might want

to take a sequence of sand pressures and turn it into a sequence of word identities

which is what's happening in speech recognition.

1:13

Sometimes we don't have a separate target sequence, and in that case we can get a

teaching signal by trying to predict the next term in the input sequence.

So the target output sequence is simply the input sequence with an advance of one

time step. This seems much more natural, than trying

to predict one pixel in an image from all the other pixels or one patch of an image

from the rest of the image. One reason it probably seems more natural

is that for temporal sequences, there is a natural order to do the predictions in.

Whereas for images it's not clear what you should predict from what.

But in fact a similar approach works very well for images.

1:58

When we predict the next term in a sequence, it blurs the distinction,

between supervised and unsupervised learning, that I made at beginning of the

course. So we use methods that were designed for

supervised learning to predict the next term.

But we don't require separate teaching signal.

So in that sense, it's unsupervised. I'm now going to give a quick review of

some of the, other models of sequences, before we get on to using recurrent neural

nets to model sequences. So a nice simple model for sequences that

doesn't have any memory is an auto regressive model.

What that does is take some previous terms in the sequence and try and predict the

next term basically as a weighted average of previous terms.

The previous terms might be individual values or they might be whole vectors.

And a linear auto regressive model would just take a weighted average of those to

predict the next term. We can make that considerably more

complicated by adding hidden units. So in a feedforward neural net, we might

take some previous input terms, put them through some hidden units, and predict the

next term. Memory list models are only one subclass

of models that can be used for sequences. We can think about ways of generating

sequences, and one very natural way to generate a sequence is to have a model

that has some hidden state which has its own internal dynamics.

So, the hidden state evolves according to its internal dynamics, and the hidden

state also produces observations, and we get to see those observations.

That's a much more interesting kind of model.

4:04

If the dynamics of the hidden state is noisy and the way it generates outputs

from its hidden state is noisy, then by observing the output of a generative model

like this, you can never know for sure what it's hidden state was.

The best you can do is to infer probability distribution over the space of

all possible hidden state vectors. You can know that it's probably in some

part of the space and not another part of the space, but you can't pin it down

exactly. So with a generative model like this, if

you get to observe what it produces, and you now try to infer what the hidden state

was, in general that's very hard, but there's two types of hidden state model

for which the computation is tractable. That is, there's a fairly straightforward

computation that allows you to infer the probability distribution over the hidden

state vectors that might have caused the data.

Of course when we do this and apply it to real data.

We're assuming that the real data is generated by our model.

So that's typically what we do when we're modeling things.

We assume the data was generated by the model and then we infer what state the

model must have been in, in order to generate that data.

5:23

The next three slides are mainly intended for people who already know about the two

types of hidden state model I'm going to describe.

The point of the slides is so that I make it clear how recurrent neural networks

differ from those standard models. If you can't follow the details of the two

standard models, don't worry too much. That's not the main point.

5:50

So one standard model is a linear dynamical system.

It's very widely used in engineering. This is a generative model that has real

valued hidden state. The hidden state has linear dynamics,

shown by those red arrows on the right. And the dynamics has Gaussian noise, so

that the hidden state evolves probabilistically.

6:15

There may also be driving inputs, shown at the bottom there, which directly influence

the hidden state. So the inputs, influence the hidden state

directly, the hidden state determines the output to predict the next output of a

system like this, we need to be able to infer its hidden state.

And these kinds of systems are used, for example, for tracking missiles.

In fact, one of the earliest uses of Gaussian distributions was for trying to

track planets from noisy observations. Gaussian actually figured out that, if you

assume Gaussian noise, you could do a good job of that.

7:00

One nice property that a Gaussian has is that if you linearly transform a gaseon

you get another Gaussian. Because all the noise in a linear dynamic

system is gaseon. It turns out that the distribution over

the hidden state given the observation so far, that is given the output so far, is

also a Gaussian. It's a full covariance Gaussian, and it's

quite complicated to compute what it is. But it can be computed efficiently.

And there's a technique called Kalman Filtering.

This is an efficient recursive way of updating your representation of the hidden

state given a new observation. So, to summarize,

Given observations of the output of the system, we can't be sure what hidden state

it was in, but we can, estimate a Gaussian distribution over the possible hidden

states it might have been in. Always assuming, of course, that our model

is a correct model of the reality we're observing.

8:06

A different kind of hidden state model that uses discrete distributions rather

than Gaussian distributions, is a hidden Markov model.

And because it's based on discrete mathematics, computer scientists love

these ones. In a hidden Markov model, the hidden state

consists of a one of N. Choice.

So there a number of things called states. And the system is always in exactly one of

those states. The transitions between states are

probabilistic. They're controlled by a transition matrix

which is simply a bunch of probabilities that say, if you're in state one at time

one, What's the probability of you going to

state three at time two? The output model is also stochastic.

So, the state that the system is in doesn't completely determine what output

it produces. There's some variation in the output that

each state can produce. Because of that, we can't be sure which

state produced a given output. In a sense, the states are hidden behind

this probabilistic veil, and that's why they're called hidden.

9:19

Historically the reason hidden units in a neural network are called hidden, is

because I like this term. It sounded mysterious, so I stole it from

neural networks. It is easy to represent the probability

distribution across n states with n numbers.

So, the nice thing about a hidden Markov model, is we can represent the probability

distribution across its discreet states. So, even though we don't know what it,

what state it's in for sure, we can easily represent the probability distribution.

9:53

And to predict the next output from a hidden Markov model, we need to infer what

hidden state it's probably in. And so we need to get our hands on that

probability distribution. It turns out there's an easy method based

on dynamic programming that allows us to take the observations we've made and from

those compute the probability distribution across the hidden states.

Once we have that distribution, there is a nice elegant learning algorithm hidden

Markov models, and that's what made them so appropriate for speech.

And in the 1970s, they took over speech recognition.

10:30

There's a fundamental limitation of HMMs. It's easiest to understand this

limitation, if we consider what happens when a hidden Markov model generates data.

At each time step when it's generating, it selects one of its hidden states.

So if it's got n hidden states, the temporal information stored in the hidden

state is at most logn) n bits. So that's all it knows about what it's

done so far. So now let's consider how much information

a hidden Markov model can convey to the second half of an utterance it produces

from the first half. So imagine it's already produced the first

half of an utterance. And now it's going to have to produce the

second half. And remember, its memory of what it said

for the first half is in which of the n states it's in.

So its memory only has log n bits of information in it.

To produce the second half that's compatible with the first half, we must

make the syntax fit. So for example, the number intend must

agree. It also needs to make the semantics fit.

It can't have the second half of the sentence be about something totally

different from the first half. Also the intonation needs to fit so it

would look very silly if the, intonation contour completely changed halfway through

the sentence. There's a lot of other things that also

have to fit. The accent of the speaker,

The rate they're speaking at, How loudly they're speaking.

And the vocal tract characteristics of the speaker.

All of those things must fit between the second half of the sentence and the first

half. And so if you wanted a hidden Markov model

to actually generate a sentence, the hidden state has to be able to convey all

that information from the first half to the second half.

12:28

Now the problem is that all of those aspects could easily come to a hundred

bits of information. So the first half of the sentence needs to

convey a hundred bits of information to the second half and that means that the

hidden Markov model needs two to the hundreds states and that's just too many.

12:46

So that brings us to recurrence your own networks.

They have a much more efficient way of remembering information.

They're very powerful because they combine two properties that have distributed

hidden state. That means, several different units can be

active at once. So they can remember several different

things at once. They don't just have one active unit.

They're also nonlinear. You see, a linear dynamical system has a

whole hidden state vector. So it's got more than one value at a time,

but those values are constrained to act in a linear way so as to make inference easy,

and in a recurrent neural network we allow the dynamics to be much more complicated.

13:34

With enough neurons and enough time, a recurring neuron network can compute

anything that can be computed by your computer.

It's a very powerful device. So linear dynamical systems and hidden

mark off models are both stochastic models.

That is the dynamics and the production of observations from the underlying state

both involve intrinsic noise. And the question is do models need to be

like that. Well one thing to notice is that the

posterior probability distribution over hidden states in either a limited anomical

system or hidden markoff model is a deterministic function of the data that

you've seen so far. That is the inference algorithm for these

systems ends up with a probability distribution, and that probability

distribution is just a bunch of numbers, and those numbers are a deterministic

version of the data so far. In a recurrent neural network, you get a

bunch of numbers that are a deterministic function of the data so far.

And it might be a good idea to think of those numbers that constitute the hidden

state of a recurrent neural network. They're very like the probability

distribution for these simple stocastic models.

15:09

Well, they can oscillate. That's obviously good for things like

motion control, where when you're walking, for example, you want to know regular

oscillation, which is your stride. They can settle to point attractors.

That might be good for retrieving memories.

And later on in the course we'll look at Hopfield nets where they use the settling

to point attractors to store memories. So the idea is you have a sort of rough

idea of what you're trying to retrieve. You then let the system settle down to a

stable point and those stable points correspond to the things you know about.

And so by settling to that stable point you retrieve a memory.

15:50

They can also behave chaotically if you set the weights in the appropriate regime.

Often, chaotic behavior is bad for information processing, because in

information processing, you want to be able to behave reliably.

You want to achieve something. There are some circumstances where it's a

good idea. If you're up against a much smarter

adversary, you probably can't outwit them, so it might be a good idea just to behave

randomly. And one way to get the appearance of

randomness is to behave chaotically. One nice thing about R and N's, which, a

long time ago, I thought was gonna make them very powerful, is that an R and N

could learn to implement lots of little programs, using different subsets of its

hidden state. And each of these little programs could

capture a nugget of knowledge. And all of these things could run in

parallel, and interact with each other in complicated ways.

16:50

Unfortunately the computational power of recurrent neural networks makes them very

hard to train. For many years, we couldn't exploit the

computational power of recurrent neural networks.

It was some heroic efforts. For example, Tony Robinson managed to make

quite a good speech recognizer using recurrent nets.

He had to do a lot of work implementing them on a parallel computer built out of

transputers. And it was only recently that people

managed to produce recurrent neural networks that outperformed Tony Robinson's