Very, very deep neural networks are difficult to train because

of vanishing and exploding gradient types of problems.

In this video, you'll learn about

skip connections which allows you to take the activation from

one layer and suddenly feed it to another layer even much deeper in the neural network.

And using that, you'll build ResNet which enables you to train very, very deep networks.

Sometimes even networks of over 100 layers. Let's take a look.

ResNets are built out of something called a residual block,

let's first describe what that is.

Here are two layers of a neural network where you

start off with some activations in layer a[l],

then goes a[l+1] and then deactivation two layers later is a[l+2].

So let's to through the steps in this computation you have a[l],

and then the first thing you do is you apply this linear operator to it,

which is governed by this equation.

So you go from a[l] to compute z[l

+1] by multiplying by the weight matrix and adding that bias vector.

After that, you apply the ReLU nonlinearity, to get a[l+1].

And that's governed by this equation where a[l+1] is g(z[l+1]).

Then in the next layer,

you apply this linear step again,

so is governed by that equation.

So this is quite similar to this equation we saw on the left.

And then finally, you apply another ReLU operation which is

now governed by that equation where G here would be the ReLU nonlinearity.

And this gives you a[l+2].

So in other words,

for information from a[l] to flow to a[l+2],

it needs to go through all of these steps which I'm going to call

the main path of this set of layers.

In a residual net,

we're going to make a change to this.

We're going to take a[l],

and just first forward it, copy it,

match further into the neural network to here,

and just at a[l],

before applying to non-linearity, the ReLU non-linearity.

And I'm going to call this the shortcut.

So rather than needing to follow the main path,

the information from a[l] can now follow

a shortcut to go much deeper into the neural network.

And what that means is that this last equation

goes away and we instead have that the output

a[l+2] is the ReLU non-linearity g applied to z[l+2] as before,

but now plus a[l].

So, the addition of this a[l] here,

it makes this a residual block.

And in pictures, you can also modify this picture on

top by drawing this picture shortcut to go here.

And we are going to draw it as it going into this second layer here

because the short cut is actually added before the ReLU non-linearity.

So each of these nodes here,

whwre there applies a linear function and a ReLU.

So a[l] is being injected after the linear part but before the ReLU part.

And sometimes instead of a term short cut,

you also hear the term skip connection,

and that refers to a[l] just skipping over a layer or kind of skipping over

almost two layers in order to process information deeper into the neural network.

So, what the inventors of ResNet,

so that'll will be Kaiming He, Xiangyu Zhang,

Shaoqing Ren and Jian Sun.

What they found was that using residual blocks

allows you to train much deeper neural networks.

And the way you build a ResNet is by taking many of these residual blocks,

blocks like these, and stacking them together to form a deep network.

So, let's look at this network.

This is not the residual network,

this is called as a plain network.

This is the terminology of the ResNet paper.

To turn this into a ResNet,

what you do is you add all those

skip connections although those short like a connections like so.

So every two layers ends up with

that additional change that we saw on

the previous slide to turn each of these into residual block.

So this picture shows five residual blocks stacked together,

and this is a residual network.

And it turns out that if you use

your standard optimization algorithm such as

a gradient descent or one of

the fancier optimization algorithms to the train or plain network.

So without all the extra residual,

without all the extra short cuts or skip connections I just drew in.

Empirically, you find that as you increase the number of layers,

the training error will tend to decrease after

a while but then they'll tend to go back up.

And in theory as you make a neural network deeper,

it should only do better and better on the training set.

Right. So, the theory, in theory,

having a deeper network should only help.

But in practice or in reality,

having a plain network, so no ResNet,

having a plain network that is very deep means that

all your optimization algorithm just has a much harder time training.

And so, in reality,

your training error gets worse if you pick a network that's too deep.

But what happens with ResNet is that even as the number of layers gets deeper,

you can have the performance of the training error kind of keep on going down.

Even if we train a network with over a hundred layers.

And then now some people experimenting with networks of

over a thousand layers although I don't see that it used much in practice yet.

But by taking these activations be it X of

these intermediate activations and allowing it to go much deeper in the neural network,

this really helps with the vanishing and exploding gradient problems

and allows you to train

much deeper neural networks without really appreciable loss in performance,

and maybe at some point, this will plateau, this will flatten out,

and it doesn't help that much deeper and deeper networks.

But ResNet is not even effective at helping train very deep networks.

So you've now gotten an overview of how ResNets work.

And in fact, in this week's programming exercise,

you get to implement these ideas and see it work for yourself.

But next, I want to share with you better intuition or

even more intuition about why ResNets work so well,

let's go onto the next video.