0:30

So let's say convolving with the first

filter gives this first 4 by 4 output, and

convolving with this second filter gives a different 4 by 4 output.

The final thing to turn this into a convolutional neural net layer,

is that for each of these we're going to add it bias,

so this is going to be a real number.

And where python broadcasting, you kind of have to add the same number so

every one of these 16 elements.

And then apply a non-linearity which for this illustration that says relative

non-linearity, and this gives you a 4 by 4 output, all right?

After applying the bias and the non-linearity.

And then for this thing at the bottom as well, you add some different bias, again,

this is a real number.

So you add the single number to all 16 numbers, and

then apply some non-linearity, let's say a real non-linearity.

And this gives you a different 4 by 4 output.

Then same as we did before, if we take this and

stack it up as follows, so we ends up with a 4 by 4 by 2 outputs.

Then this computation where you come from a 6 by 6 by 3 to 4 by 4 by 4,

this is one layer of a convolutional neural network.

So to map this back to one layer of four propagation in the standard neural

network, in a non-convolutional neural network.

Remember that one step before the prop was something like this, right?

z1 = w1 times a0, a 0 was also equal to x,

and then plus b[1].

And you apply the non-linearity to get a[1], so that's g(z[1]).

So this input here, in this analogy this is a[0], this is x3.

2:44

And these filters here,

this plays a role similar to w1.

And you remember during the convolution operation, you were taking

these 27 numbers, or really well, 27 times 2, because you have two filters.

You're taking all of these numbers and multiplying them.

So you're really computing a linear function to get this 4 x 4 matrix.

So that 4 x 4 matrix, the output of the convolution operation,

that plays a rolesimilar to w1 times a0.

That's really maybe the output of this 4 x 4 as well as that 4 x 4.

And then the other thing you do is add the bias.

So, this thing here before applying value,

this plays a role similar to z.

And then it's finally by applying the non-linearity, this kind of this I guess.

So, this output plays a role,

this really becomes your activation at the next layer.

So this is how you go from a0 to a1, as far as tthe linear operation and

then convolution has all these multipled.

So the convolution is really applying a linear operation and

you have the biases and the applied value operation.

And you've gone from a 6 by 6 by 3, dimensional a0,

through one layer of neural network to,

I guess a 4 by 4 by 2 dimensional a(1).

And so 6 by 6 by 3 has gone to 4 by 4 by 2, and so

that is one layer of convolutional net.

4:33

Now in this example we have two filters, so we had two features of you will,

which is why we wound up with our output 4 by 4 by 2.

But if for example we instead had 10 filters instead of 2,

then we would have wound up with the 4 by 4 by 10 dimensional output volume.

Because we'll be taking 10 of these naps not just two of them, and stacking them

up to form a 4 by 4 by 10 output volume, and that's what a1 would be.

So, to make sure you understand this, let's go through an exercise.

Let's suppose you have 10 filters, not just two filters, that are 3 by 3 by 3 and

1 layer of a neural network, how many parameters does this layer have?

5:21

Well, let's figure this out.

Each filter, is a 3 x 3 x 3 volume, so 3 x 3 x 3,

so each fill has 27 parameters, all right?

There's 27 numbers to be run, and plus the bias.

5:50

And then if you imagine that on the previous slide we had drawn two filters,

but now if you imagine that you actually have ten of these, right?

1, 2..., 10 of these,

then all together you'll have 28 times 10,

so that will be 280 parameters.

Notice one nice thing about this, is that no matter how big the input image is,

the input image could be 1,000 by 1,000 or 5,000 by 5,000,

but the number of parameters you have still remains fixed as 280.

And you can use these ten filters to detect features, vertical edges,

horizontal edges maybe other features anywhere even in a very,

very large image is just a very small number of parameters.

6:40

So these is really one property of convolution neural network that

makes less prone to overfitting then if you could.

So once you've learned 10 feature detectors that work,

you could apply this even to large images.

And the number of parameters still is fixed and relatively small,

as 280 in this example.

All right, so to wrap up this video let's just summarize the notation we

are going to use to describe one layer to describe a covolutional layer in

a convolutional neural network.

So layer l is a convolution layer,

l am going to use f superscript,[l] to denote the filter size.

So previously we've been seeing the filters are f by f, and

now this superscript square bracket l just denotes that this is

a filter size of f by f filter layer l.

And as usual the superscript square bracket l is the notation we're using to

refer to particular layer l.

7:39

going to use p[l] to denote the amount of padding.

And again, the amount of padding can also be specified just by saying that you want

a valid convolution, which means no padding, or

a same convolution which means you choose the padding.

So that the output size has the same height and width as the input size.

8:03

Now, the input to this layer is going to be some dimension.

It's going be some n by n by number of channels in the previous layer.

Now, I'm going to modify this notation a little bit.

I'm going to us superscript l- 1,

because that's the activation from

the previous layer, l- 1 times nc of l- 1.

And in the example so far, we've been just using images of the same height and width.

That in case the height and width might differ,

l am going to use superscript h and superscript w, to denote the height and

width of the input of the previous layer, all right?

So in layer l, the size of the volume will be nh

by nw by nc with superscript squared bracket l.

It's just in layer l, the input to this layer Is whatever you had for

the previous layer, so that's why you have l- 1 there.

And then this layer of the neural network will itself output the value.

So that will be nh of l by nw of l, by nc of l,

that will be the size of the output.

And so whereas we approve this set that the output volume size or

at least the height and weight is given by this formula,

n + 2p- f over s + 1, and then take the full of that and round it down.

In this new notation what we have is that the outputs value that's in layer l,

is going to be the dimension from the previous layer,

plus the padding we're using in this layer l,

minus the filter size we're using this layer l and so on.

And technically this is true for the height, right?

So the height of the output volume is given by this, and you can compute it

with this formula on the right, and the same is true for the width as well.

So you cross out h and

throw in w as well, then the same formula with either the height or

the width plugged in for computing the height or width of the output value.

10:36

So that's how nhl -1 relates to nhl and wl- 1 relates to nwl.

Now, how about the number of channels, where did those numbers come from?

Let's take a look, if the output volume has this depth,

while we know from the previous examples that that's equal

to the number of filters we have in that layer, right?

So we had two filters, the output value was 4 by 4 by 2, was 2 dimensional.

And if you had 10 filters and your upper volume was 4 by 4 by 10.

So, this the number of channels in the output value,

that's just the number of filters we're using in this layer of the neural network.

Next, how about the size of this filter?

Well, each filter is going to be fl by fl by 100 number, right?

So what is this last number?

Well, we saw that you needed to convolve a 6 by 6 by 3 image,

with a 3 by 3 by 3 filter.

11:43

And so the number of channels in your filter, must match the number of channels

in your input, so this number should match that number, right?

Which is why each filter is going to be f(l) by f(l) by nc(l-1).

And the output of this layer often apply devices in non-linearity,

is going to be the activations of this layer al.

And that we've already seen will be this dimension, right?

The al will be a 3D volume,

that's nHl by nwl by ncl.

And when you are using a vectorized implementation or batch gradient

descent or mini batch gradient descent, then you actually outputs Al,

which is a set of m activations, if you have m examples.

So that would be M by nHl, by nwl by ncl right?

If say you're using bash grading decent and

in the programming sizes this will be ordering of the variables.

And we have the index and the trailing examples first,

and then these three variables.

Next how about the weights or the parameters, or kind of the w parameter?

Well we saw already what the filter dimension is.

So the filters are going to be f[l] by f[l] by nc [l- 1],

but that's the dimension of one filter.

How many filters do we have?

Well, this is a total number of filters, so

the weights really all of the filters put together will have dimension given

by this, times the total number of filters, right?

Because this, Last quantity is the number of

filters, In layer l.

13:45

And then finally, you have the bias parameters, and

you have one bias parameter, one real number for each filter.

So you're going to have, the bias will have this many variables,

it's just a vector of this dimension.

Although later on we'll see that the code will be

more convenient represented as 1 by 1 by 1 by nc[l]

four dimensional matrix, or four dimensional tensor.

14:16

So I know that was a lot of notation, and

this is the convention I'll use for the most part.

I just want to mention in case you search online and look at open source code.

There isn't a completely universal standard convention about the ordering of

height, width, and channel.

So If you look on source code on GitHub or these open source implementations,

you'll find that some authors use this order instead, where you first put

the channel first, and you sometimes see that ordering of the variables.

And in fact in some common frameworks, actually in multiple common frameworks,

there's actually a variable or a parameter.

Why do you want to list the number of channels first, or

list the number of channels last when indexing into these volumes.

I think both of these conventions work okay, so long as you're consistent.

And unfortunately maybe this is one piece of annotation where

there isn't consensus in the deep learning literature but

i'm going to use this convention for these videos.

15:24

Where we list height and width and then the number of channels last.

So I know there was certainly a lot of new notations you could use, but you're

thinking wow, that's a long notation, how do I need to remember all of these?

Don't worry about it, you don't need to remember all of this notation, and

through this week's exercises you become more familiar with it at that time.

But the key point I hope you take a way from this video,

is just one layer of how convolutional neural network works.

And the computations involved in taking the activations of one layer and

mapping that to the activations of the next layer.

And next, now that you know how one layer of the compositional neural network works,

let's stack a bunch of these together to actually form a deeper compositional

neural network.

Let's go on to the next video to see,