0:00

By now, you've seen most of the cheap building blocks of RNNs.

But, there are just two more ideas that let you build much more powerful models.

One is bidirectional RNNs,

which lets you at a point in time to take

information from both earlier and later in the sequence,

so we'll talk about that in this video.

And second, is deep RNNs,

which you'll see in the next video.

So let's start with Bidirectional RNNs.

So, to motivate bidirectional RNNs,

let's look at this network which you've seen a few times

before in the context of named entity recognition.

And one of the problems of this network is that,

to figure out whether the third word Teddy is a part of the person's name,

it's not enough to just look at the first part of the sentence.

So to tell, if Y three should be zero or one,

you need more information than

just the first three words because the first three words doesn't tell you if they'll

talking about Teddy bears or talk about the former US president, Teddy Roosevelt.

So this is a unidirectional or forward directional only RNN.

And, this comment I just made is true,

whether these cells are

standard RNN blocks or whether they're GRU units or whether they're LSTM blocks.

But all of these blocks are in a forward only direction.

So what a bidirectional RNN does or BRNN,

is fix this issue.

So, a bidirectional RNN works as follows.

I'm going to use a simplified four inputs or maybe a four word sentence.

So we have four inputs.

X one through X four.

So this networks heading there will have a forward recurrent components.

So I'm going to call this, A one, A two,

A three and A four,

and I'm going to draw a right arrow

over that to denote this is the forward recurrent component,

and so they'll be connected as follows.

And so, each of these four recurrent units inputs the current X,

and then feeds in to

help predict Y-hat one,

Y-hat two, Y-hat three, and Y-hat four.

So, so far I haven't done anything.

Basically, we've drawn the RNN from the previous slide,

but with the arrows placed in slightly funny positions.

But I drew the arrows in

this slightly funny positions because what we're going to

do is add a backward recurrent layer.

So we'd have A one,

left arrow to denote this is a backward connection,

and then A two, backwards,

A three, backwards and A four,

backwards, so the left arrow denotes that it is a backward connection.

And so, we're then going to connect to network up as follows.

And this A backward connections will be connected to each other going backward in time.

So, notice that this network defines a Acyclic graph.

And so, given an input sequence, X one through X four,

the fourth sequence will first compute A forward one,

then use that to compute A forward two,

then A forward three, then A forward four.

Whereas, the backward sequence would start by computing A backward four,

and then go back and compute A backward three,

and then as you are computing network activation,

this is not backward this is forward prop.

But the forward prop has part

of the computation going from left to right and

part of computation going from right to left in this diagram.

But having computed A backward three,

you can then use those activations to compute A backward two,

and then A backward one, and then finally having computed all you had in the activations,

you can then make your predictions.

And so, for example,

to make the predictions,

your network will have something like Y-hat at time t is an activation function

applied to WY with both the forward activation at time t,

and the backward activation at time

t being fed in to make that prediction at time t. So,

if you look at the prediction at time set three for example,

then information from X one can flow through here,

forward one to forward two,

they're are all stated in the function here, to forward three to Y-hat three.

So information from X one, X two,

X three are all taken into account with information from X four can flow

through a backward four to a backward three to Y three.

So this allows the prediction at time three to take

as input both information from the past,

as well as information from the present which goes

into both the forward and the backward things at this step,

as well as information from the future.

So, in particular, given a phrase like, "He said,

Teddy Roosevelt..." To predict

whether Teddy is a part of the person's name,

you take into account information from the past and from the future.

So this is the bidirectional recurrent neural network and these blocks

here can be not just the standard RNN block but they

can also be GRU blocks or LSTM blocks.

In fact, for a lots of NLP problems,

for a lot of text with natural language processing problems,

a bidirectional RNN with a LSTM appears to be commonly used.

So, we have NLP problem and you have the complete sentence,

you try to label things in the sentence,

a bidirectional RNN with LSTM blocks both

forward and backward would be a pretty views of first thing to try.

So, that's it for the bidirectional RNN and this is

a modification they can make to the basic RNN architecture or the GRU or the LSTM,

and by making this change you can have a model that

uses RNN and or GRU or LSTM and is able to make

predictions anywhere even in the middle of a sequence by taking into

account information potentially from the entire sequence.

The disadvantage of the bidirectional RNN is that you do

need the entire sequence of data before you can make predictions anywhere.

So, for example, if you're building a speech recognition system,

then the BRNN will let you take into account

the entire speech utterance but if you use this straightforward implementation,

you need to wait for the person to stop talking to get

the entire utterance before you can

actually process it and make a speech recognition prediction.

So for a real type speech recognition applications,

they're somewhat more complex modules as well rather than just

using the standard bidirectional RNN as you've seen here.

But for a lot of natural language processing applications where

you can get the entire sentence all the same time,

the standard BRNN algorithm is actually very effective.

So, that's it for BRNNs and next and final video for this week,

let's talk about how to take all of these ideas RNNs,

LSTMs and GRUs and the bidirectional versions and construct deep versions of them.