0:00

When designing a layer for a ConvNet, you might have to pick,

do you want a 1 by 3 filter,

or 3 by 3, or 5 by 5,

or do you want a pooling layer?

What the inception network does is it says,

why should you do them all?

And this makes the network architecture more complicated,

but it also works remarkably well.

Let's see how this works.

Let's say for the sake of example that you have inputted a

28 by 28 by 192 dimensional volume.

So what the inception network or what an inception layer says is,

instead choosing what filter size you want in a Conv layer,

or even do you want a convolutional layer or a pooling layer?

Let's do them all. So what if you can use a 1 by 1 convolution,

and that will output a 28 by 28 by something.

Let's say 28 by 28 by 64 output,

and you just have a volume there.

But maybe you also want to try a 3 by 3 and that might output a 20 by 20 by 128.

And then what you do is just stack up this second volume next to the first volume.

And to make the dimensions match up,

let's make this a same convolution.

So the output dimension is still 28 by 28,

same as the input dimension in terms of height and width.

But 28 by 28 by in this example 128.

And maybe you might say well I want to hedge my bets.

Maybe a 5 by 5 filter works better.

So let's do that too and have that output a 28 by 28 by 32.

And again you use the same convolution to keep the dimensions the same.

And maybe you don't want to convolutional layer.

Let's apply pooling, and that has some other output and let's stack that up as well.

And here pooling outputs 28 by 28 by 32.

Now in order to make all the dimensions match,

you actually need to use padding for max pooling.

So this is an unusual formal pooling because if you want

the input to have a higher than 28 by 28 and have the output,

you'll match the dimension everything else also by 28 by 28,

then you need to use the same padding as well as a stride of one for pooling.

So this detail might seem a bit funny to you now,

but let's keep going.

And we'll make this all work later.

But with a inception module like this,

you can input some volume and output.

In this case I guess if you add up all these numbers,

32 plus 32 plus 128 plus 64,

that's equal to 256.

So you will have one inception module input 28 by 28 by 129,

and output 28 by 28 by 256.

And this is the heart of the inception network which is due

to Christian Szegedy, Wei Liu,

Yangqing Jia, Pierre Sermanet,

Scott Reed, Dragomir Anguelov, Dumitru Erhan,

Vincent Vanhoucke and Andrew Rabinovich.

And the basic idea is that instead of

you needing to pick one of these filter sizes or pooling you want and committing to that,

you can do them all and just concatenate all the outputs,

and let the network learn whatever parameters it wants to use,

whatever the combinations of these filter sizes it wants.

Now it turns out that there is a problem

with the inception layer as we've described it here,

which is computational cost.

On the next slide,

let's figure out what's the computational cost of this 5 by 5 filter resulting

in this block over here.

So just focusing on the 5 by 5 pot on the previous slide,

we had as input a 28 by 28 by 192 block,

and you implement a 5 by 5 same convolution of 32 filters to output 28 by 28 by 32.

On the previous slide I had drawn this as a thin purple slide.

So I'm just going draw this as a more normal looking blue block here.

So let's look at the computational costs of outputting this 20 by 20 by 32.

So you have 32 filters because the outputs has 32 channels,

and each filter is going to be 5 by 5 by 192.

And so the output size is 20 by 20 by 32,

and so you need to compute 28 by 28 by 32 numbers.

And for each of them you need to do these many multiplications, right?

5 by 5 by 192.

So the total number of multiplies you need

is the number of multiplies you need to compute each

of the output values times the number of output values you need to compute.

And if you multiply all of these numbers,

this is equal to 120 million.

And so, while you can do 120 million multiplies on the modern computer,

this is still a pretty expensive operation.

On the next slide you see how using the idea of 1 by 1 convolutions,

which you learnt about in the previous video,

you'll be able to reduce the computational costs by about a factor of 10.

To go from about 120 million multiplies to about one tenth of that.

So please remember the number 120 so you can compare it

with what you see on the next slide, 120 million.

Here is an alternative architecture for inputting 28 by 28 by 192,

and outputting 28 by 28 by 32, which is falling.

You are going to input the volume,

use a 1 by 1 convolution to reduce the volume to 16 channels instead of 192 channels,

and then on this much smaller volume,

run your 5 by 5 convolution to give you your final output.

So notice the input and output dimensions are still the same.

You input 28 by 28 by 192 and output 28 by 28 by 32,

same as the previous slide.

But what we've done is we're taking this huge volume we had on the left,

and we shrunk it to this much smaller intermediate volume,

which only has 16 instead of 192 channels.

Sometimes this is called a bottleneck layer, right?

6:53

I guess because a bottleneck is usually the smallest part of something, right?

So I guess if you have a glass bottle that looks like this,

then you know this is I guess where the cork goes.

And then the bottleneck is the smallest part of this bottle.

So in the same way, the bottleneck layer is the smallest part of this network.

We shrink the representation before increasing the size again.

Now let's look at the computational costs involved.

To apply this 1 by 1 convolution,

we have 16 filters.

Each of the filters is going to be of dimension 1 by 1 by 192,

this 192 matches that 192.

And so the cost of computing this 28 by 28

by 16 volumes is going to be well,

you need these many outputs,

and for each of them you need to do 192 multiplications.

I could have written 1 times 1 times 192, right?

Which is this. And if you multiply this out,

this is 2.4 million,

it's about 2.4 million.

How about the second?

So that's the cost of this first convolutional layer.

The cost of this second convolutional layer would be that well,

you have these many outputs.

So 28 by 28 by 32.

And then for each of the outputs you have to apply a 5 by 5 by 16 dimensional filter.

And so by 5 by 5 by 16.

And you multiply that out is equals to 10.0.

And so the total number of multiplications you need to do is the sum of those

which is 12.4 million multiplications.

And you compare this with what we had on the previous slide,

you reduce the computational cost from about 120 million multiplies,

down to about one tenth of that,

to 12.4 million multiplications.

And the number of additions you need to do is

about very similar to the number of multiplications you need to do.

So that's why I'm just counting the number of multiplications.

So to summarize, if you are building a layer

of a neural network and you don't want to have to decide,

do you want a 1 by 1,

or 3 by 3, or 5 by 5, or pooling layer,

the inception module let's you say let's do them all,

and let's concatenate the results.

And then we run to the problem of computational cost.

And what you saw here was how using a 1 by 1 convolution,

you can create this bottleneck layer

thereby reducing the computational cost significantly.

Now you might be wondering,

does shrinking down the representation size so dramatically,

does it hurt the performance of your neural network?

It turns out that so long as you implement this bottleneck layer so that within reason,

you can shrink down the representation size significantly,

and it doesn't seem to hurt the performance,

but saves you a lot of computation.

So these are the key ideas of the inception module.

Let's put them together and in

the next video show you what the full inception network looks like.