0:00

I'm Hanlin, from Intel.

In this lecture, we will discuss, recurrent neural networks.

We will start with a review of important topics, from previous lectures.

And then provide an overview,

of how our own ends operate.

We will then discuss word in beddings,

and then take a look, at some of the key,

neural network, architectures out there,

including Alice T.M.

and GRU networks.

As we now know,

training a neural network, consists of randomly,

initializing the weights, fetching a batch of data,

and then a forward, propagating it, through the network.

We can then compute a cost,

which will then be used,

to back propagate, and determine our weight updates.

How do we change our ways,

to reduce the cost?

In this example, the network is being trained,

to recognize, handwritten digits,

using the MNIS data set.

Back propagation uses a chain rule,

to compute the partial derivative,

of the cost, with respect to all the ways in the network.

Using grading descent, or a variant of gradient descent,

we can take a step,

towards a better set of weights, with the lower costs.

All of these operations, can be computed,

efficiently, as Matrix multiplications.

So why do we need recurrent neo networks?

Feed forward neural networks,

make a couple of assumptions.

For example, they Assume independence,

in the training example.

And that after, a training example has passed,

through the network, the state is then discarded.

All we care about, are the gradient networks.

But in many cases,

such as with the sequences of words,

there is temporal dependence,

and contextual dependence, that needs to be explicitly captured, by the model.

Also, most feed forward,

neuro networks, assume, an input vector, of a fixed length.

For example, in all of our previous, slides,

the images, were all fixed,

in size, across a batch.

However, text, or, speech,

can vary greatly in size,

across sometimes, in order of magnitude.

And that is a variability that cannot be captured,

by feed forward networks.

Instead, our own ends, introduce,

an explicit, modeling, of sequentiality.

Which allows it to both,

capture short range, and long range dependencies, and the data.

Through training, such a model,

can also learn, and adapt,

to the time scales, and the data itself.

So our own ends are often used, to handle,

any type of data, where we have variable sequence lengths, and, where this,

idea, of contextual dependence,

and temporal dependencies, that have to be captured,

in order, for the task to be accomplished successfully.

The building block, of recurrent neuro networks,

is a recurrent neuron.

What I am showing here, is,

a simple afine layer, similar,

to what I had shown previously, where,

the output of the unit, is the,

input, multiply, by a weight matrix.

To turn this, into a recurrent, neuron,

we add a recurrent connection,

such that, the activity,

at, a particular time t,

depends on, it's activity,

at the previous time,

T minus one, but multiply,

by a recurrent matrix,

W R. And so now we have our equations,

which were, exactly the same,

as what we had seen before,

except we have this additional term,

here, to represent, that the activity depends,

on the activity of the previous time step.

As determined, through the weight matrix,

W R. Additionally, we add,

an output, to this model,

y of T, which takes, the, activity,

of that recurring neuron,

and passes it through,

a matrix W Y,

to provide the output.

And so here you have, the fundamental,

building block, of, a recurrent neuron.

So how do we train, such a network?

We can unroll, their current neural network,

into a future forward one.

So here's what I mean by that.

Here is our same,

recurring neuron before, where we have it,

taking its input X of T, multiplying it,

by a way matrix, to provide,

h of t, this hidden representation.

And we can read out,

from this hidden representation,

by passing it through a way matrix,

W Y, to get the output W T. And importantly here,

we have an arrow, W R,

to represent the fact, that, the,

subsequent, time steps activity,

H of T plus one,depends,

on what came before it.

And so here you can see,

unrolling the network, into a feed forward Network, and Time,

except where, the way, W R,

that spans, these different layers,

in the feed forward network, are tied together.

But once we've, unfurled, this,

recurrent neural network, into a feed forward network,

we can apply, the same, back propagation,

equations, that I had shown,

previously, to train this network.

Now that you understand,

how an RNN works,

we can look at how they're used.

So here is an example,

where we use an R N, to learn, a language model.

Such that, it correctly, predicts,

the next letter, in the word,

based, on the previous letter.

So you can imagine a world,

where we have a vocabulary,

of forward letters, so the input here, is encoded.

So here we have the input, character age.

And we have the values, in these gray,

boxes, that represent, the hidden activations.

And, at every time,

we read out, a prediction,

of the next character.

So here, the model, is quite confident,

that the next character is going to be e. And as we move on,

and apply more and more characters, it can begin to,

predict, additional, characters, in the next time step.

So this is nothing more than a language model.

Where a model is learning to predict,

what happens next, based on,

the sequence of characters,

that it had seen before.

Here's an example, of that same language model,

but now apply, to words,

where you have the input, cash,

flow is high, and then a model, is predicting,

the next word, in that sentence,

based on what, has been seen, before.

You will also notice in this prediction,

that there is a level, of ambiguity, here.

When the model, sees, it's cash, flow,

is, depending on the state of the market,

you can see that its output,

predictions, in this green box,

here, are sort of,

split, between low, and high.

Which makes sense, as both, are, syntactically, correct.

The difference in probabilities, between the two,

may be related to the context of the text,

or previous data, that it has, been trained on.

Recurrent neo networks, and also be merged,

with convolutional neural networks,

to produce an image capturing network.

As seen here, after we provide,

the image as the input,

the network can learn,

to generate, a caption,

such as, a group of people,

shopping at an outdoor, market.

And here, the input to the RNN,

could be their feature representation,

in one of the last, few layers,

of the convolutional Neo network.

Training recurrent neo networks, is very,

similar, to that, of a feed forward network.

Using the chain rule, you can determine,

the partial of the cost,

with respect, to each weight, in the network.

In contrast, to feed forward networks,

now we have, costs,

associated, with every single timestamp.

And so, how we do, is, we combine,

the gradients, across, these time steps, as shown here.

The issue of vanishing, and exploding gradients,

can be quite, problematic,

for recurrent neo networks.

With especially, deep networks, the weight,

W R, is repetitively,

multiplied, over, at each time step.

So in that way, the magnitude,

of your gradients, is proportional, to the magnitude,

of W R, to the power,

of t. What this means, is that,

if the weight is greater than 1,

the gradients, can explode.

Whereas, if the gradients are less than one,

they can possibly vanish.

Thats going to also depend a lot,

on the activation function, in unhidden node.

Given, a rectified, linear unit,

It is easier, to imagine,

for example, an exploding gradient.

Whereas, with the sigmoid,

activation function, the vanishing gradient problem,

becomes, much more pressing.

It is actually, quite easy,

to detect, Exploding gradients, during training.

You simply, see, the cost, explode.

And to counteract, that, one can simply,

clip the gradients, inside a particular, threshold.

Thus, force, the gradients,

to be within a certain, bound.

Or additionally, optimization methods, such as, RMSprop,

can adaptively, adjust to learning rate,

depending, on the size, of the gradient itself.

Vanishing gradients, however, are more problematic to find.

Because, it is not always, obvious,

when they occur, or how to deal with them.

One of the more popular methods,

of addressing this issue,

is actually, not to use, the RNNs,

that I had introduced earlier,

but to use, LSTM and GRU networks.

And that's what we're going to discuss next.

So, one way, to combat, the,

Vanishing, or Exploding gradient problem, is to have,

this very simple model here,

where you have a unit,

that's connected to itself,

over time, with a weight of one.

So now you can see, that,

as you unroll, this network,

the, activity, at every time step,

is equal, to the activity,

at the previous time step.

So while, you no longer have,

a Vanishing, or Exploding gradient, here,

It's not a very interesting behavior,

because it just repeating itself,

over and over again,

as if it were, sort of a memory unit.

So, what we're going to do,

is, we're going to manipulate,

this memory unit, by adding, different operations.

So adding, the ability, to,

flush the memory, by rewriting it.

The ability, to, add to the memory,

and the ability, to read, from the memory.

So what we do, is,

we have, this, memory,

as I had shown you before, where,

each unit, was connected to the next one, with a weight of one.

And then, we attach,

what's shown here, as an output gate.

So, when we want to, read,

from a memory cell,

we take the activity,

of that memory cell, which is a vector,

and we pass it through, a tanh function,

and multiply, by a gate, as shown here,

as O of t. So this,

gate, is a vector of numbers,

between zero, and one.

And it controls, what exactly,

is emitted from the network.

The gate itself, is affine layer,

with a sigmoid activation function.

During training, the weights are learning to produce,

a right output, from the model,

from, the hidden state, of the network.

You can see this math, being shown,

in this equation here, where the,

output of the model, is,

an activation function, that wraps, an affine layer.

And then we do, an element wise, multiplication,

with the tanh, of the memory cell.

To make this, easier,

to express, we, represent, the output gate,

as O of t. So now you just have, O of t,

and then the Illinois operation,

with a tanh, of the activity from the memory cell.

And importantly, this output gate, is simply,

an affine layer, very similar,

to what we had introduced previously.

The forget gate, follows,

a similar, approach, to the output gate.

We use an affine layer,

with a different set of weights,

whose outputs, are between, zero and one.

And we, insert that gate,

as a multiplication, between,

the memory cell, at time t,

and a t time t plus one.

So if these values, are close to zero, the values,

in the memory cell, are forgotten,

from one time step to the next.

Or, if they are close to one,

they're being maintained, across time.

The Input gate, is used to write,

new data, to the memory cell.

And it has two components,an affine layer,

with weights, Wc, and a Tanh, activation function.

Which generates, a new,

proposed, input, into the memory cell.

And it also, has,

contain in it, in input gate,

as shown here, which modulates,

to propose, input, and then writes it,

to the memory cell, here.

So we can think of the next,

stage, of the LSTM,

C of t, plus one,

as, how much you want to, forget,

from the previous time step, plus, a proposal,

for the new time step input,

multiply, by how much we want to accept, this new proposal.

It is important to remember here, that,

all the values in the network,

are vectors, and not scalars.

So here is, the LSTM model,

with the forget gate,

the input gate, and the output gate.

And here's an example,

of an LSTM, where,

the value, will be recorded in the memory cell,

is the gender identity, of the speaker.

And you can imagine this,

being a very important value,

for conditioning, the predictions, of the model itself.

So you can see here, that,

when you encounter, the word Bob, you have learned,

to, forget that, previous,

activity, because, now you have a new gender, for the speaker.

And you will learn to, overwrite, that value,

with value of one,

to represent, a male speaker.

And then the model continues on, processing more data.

So in this scenario, the forget gate,

outputs zero, because you want to forget everything that came before.

And you wanna, input one,

into the model itself.

So in this example, the forget gate,

is zero, because we want to forget,

what had come before.

And then, the input gate, is one,

to represent, the fact that we have a male speaker,

in this particular sentence.

And that is important, because when we reach a prediction phase, we can use that,

to predict, his, instead of her,

as the next possible word, in the sentence.

Another, popular architecture, is Gated Recurring Units, or GRUs.

And they're essentially, a simplified, LSTM.

Here, all the gates,

are compressed, into one update gate.

And the input module,

is an affine layer,

that proposes the input,

which is combined, with an update gate,

to obtain, the representation, at the next time step.

The remember gate, controls, how much,

the previous, time representation, impacts your proposal.

We've seen in many scenarios, where, that GRU,

performed, similarly, to the LSTM.

So in that way, it is somewhat attractive,

because, of its more simplified, representation.

Bidirectional RNNs, are also recurring neural networks.

Except that, they connect, two hidden layers,

of opposite directions, of the same output.

With this structure, the upper layer, can get information,

from both the past, and, future states.

Additionally, you can stack, these bi RNNs,

on top of each other, to obtain,

more complex, abstractions, and features, of the input.

These, architectures, are often,

used, in speech applications.

Where they transcription, from a speech,

may depend, not just, on,

the, sound, that had come before,

but also, the audio afterwards.

A great application of this,

is the Deep Speech Two model,

which is a state of the art,

speech transcription model, that was,

published, by two, several years ago.

It is important to understand,

what LSTM units, learn,

after the training process is completed.

So, this is work, by Andre Carrpathi,

where he trained, a language model, Recurrent Neural Network,

on, several important texts,

such as, the war and peace,

novel, and also, a corpus of Linux Kernels.

And he has identified, individual,

cells, that are sensitive to particular properties.

So on the top here, you can see a cell that is sensitive,

to position, in the line itself.

Or a cell, that changes his activation,

based on, whether it encounters,

an inside quote, or not.

So, in all of these examples,

you have the text, and the caller,

represents the activity, of this particular, identified, unit.

You can see, in, the Linux Kernels,

dataset, we have cells,

that are robustly, activate, inside, if statements.

Or even cells, that turn, on,

comments, or quotes, in the code itself.

It is exciting, to see,

what we can build,

with the current neural networks, and,

the type of visualizations,

that will become available.

Try to understand, what,

these recurrent Neural Networks,

are learning, Isley, and just,

is a large corpus, of natural language.