0:00

In the last video,

you learned about the sliding windows

object detection algorithm using a convnet but we saw that it was too slow.

In this video, you'll learn how to implement that algorithm convolutionally.

Let's see what this means.

To build up towards the convolutional implementation of sliding windows let's first see

how you can turn fully connected layers in neural network into convolutional layers.

We'll do that first on this slide and then the next slide,

we'll use the ideas from this slide to show you the convolutional implementation.

So let's say that your object detection algorithm inputs 14 by 14 by 3 images.

This is quite small but just for illustrative purposes,

and let's say it then uses 5 by 5 filters,

and let's say it uses 16 of them to map it from 14 by 14 by 3 to 10 by 10 by 16.

And then does a 2 by 2 max pooling to reduce it to 5 by 5 by 16.

Then has a fully connected layer to connect to 400 units.

Then now they're fully connected layer and then finally outputs a Y using a softmax unit.

In order to make the change we'll need to in a second,

I'm going to change this picture a little bit and instead I'm

going to view Y as four numbers,

corresponding to the cause probabilities of

the four causes that softmax units is classified amongst.

And the full causes could be pedestrian,

car, motorcycle, and background or something else.

Now, what I'd like to do is show how

these layers can be turned into convolutional layers.

So, the convnet will draw same as before for the first few layers.

And now, one way of implementing this next layer,

this fully connected layer is to implement this as a 5 by

5 filter and let's use 400 5 by 5 filters.

So if you take a 5 by 5 by 16 image and convolve it with a 5 by 5 filter, remember,

a 5 by 5 filter is implemented as 5 by 5 by

16 because our convention is that the filter looks across all 16 channels.

So this 16 and this 16 must match and so the outputs will be 1 by 1.

And if you have 400 of these 5 by 5 by 16 filters,

then the output dimension is going to be 1 by 1 by 400.

So rather than viewing these 400 as just a set of nodes,

we're going to view this as a 1 by 1 by 400 volume.

Mathematically, this is the same as a fully connected layer

because each of these 400 nodes has a filter of dimension 5 by 5 by 16.

So each of those 400 values is

some arbitrary linear function of these 5 by 5 by 16 activations from the previous layer.

Next, to implement the next convolutional layer,

we're going to implement a 1 by 1 convolution.

If you have 400 1 by 1 filters then,

with 400 filters the next layer will again be 1 by 1 by 400.

So that gives you this next fully connected layer.

And then finally, we're going to have another 1 by 1 filter,

followed by a softmax activation.

So as to give a 1 by 1 by 4 volume

to take the place of these four numbers that the network was operating.

So this shows how you can take these fully connected layers

and implement them using convolutional layers so

that these sets of units instead are not implemented

as 1 by 1 by 400 and 1 by 1 by 4 volumes.

After this conversion, let's see how you

can have a convolutional implementation of sliding windows object detection.

The presentation on this slide is based on the OverFeat paper,

referenced at the bottom, by Pierre Sermanet,

David Eigen, Xiang Zhang,

Michael Mathieu, Robert Fergus and Yann Lecun.

Let's say that your sliding windows convnet inputs 14 by 14 by 3 images and again,

I'm just using small numbers like the 14 by 14 image

in this slide mainly to make the numbers and illustrations simpler.

So as before, you have a neural network as

follows that eventually outputs a 1 by 1 by 4 volume,

which is the output of your softmax.

Again, to simplify the drawing here,

14 by 14 by 3 is technically a volume 5 by 5 or 10 by 10 by 16,

the second clear volume.

But to simplify the drawing for this slide,

I'm just going to draw the front face of this volume.

So instead of drawing 1 by 1 by 400 volume,

I'm just going to draw the 1 by 1 cause of all of these.

So just dropped the three components of these drawings, just for this slide.

So let's say that your convnet inputs 14 by 14 images or 14 by

14 by 3 images and your tested image is 16 by 16 by 3.

So now added that yellow stripe to the border of this image.

In the original sliding windows algorithm,

you might want to input the blue region into

a convnet and run that once to generate a consecration 01 and then slightly down a bit,

least he uses a stride of two pixels and then you might slide that to the right by

two pixels to input

this green rectangle into the convnet and

we run the whole convnet and get another label, 01.

Then you might input

this orange region into the convnet and run it one more time to get another label.

And then do it the fourth and final time with this lower right purple square.

To run sliding windows on this 16 by 16 by 3 image is pretty small image.

You run this convnet four times in order to get four labels.

But it turns out a lot of this computation

done by these four convnets is highly duplicative.

So what the convolutional implementation of sliding windows does is it allows

these four pauses in the convnet to share a lot of computation.

Specifically, here's what you can do.

You can take the convnet and just run it same parameters,

the same 5 by 5 filters,

also 16 5 by 5 filters and run it.

Now, you can have a 12 by 12 by 16 output volume.

Then do the max pool, same as before.

Now you have a 6 by 6 by 16,

runs through your same 400 5 by 5 filters to get now your 2 by 2 by 40 volume.

So now instead of a 1 by 1 by 400 volume,

we have instead a 2 by 2 by 400 volume.

Run it through a 1 by 1 filter gives

you another 2 by 2 by 400 instead of 1 by 1 like 400.

Do that one more time and now you're left with a

2 by 2 by 4 output volume instead of 1 by 1 by 4.

It turns out that this blue 1 by 1 by 4 subset gives

you the result of running in the upper left hand corner 14 by 14 image.

This upper right 1 by 1 by 4 volume gives you the upper right result.

The lower left gives you the results of

implementing the convnet on the lower left 14 by 14 region.

And the lower right 1 by 1 by 4 volume gives you the same result

as running the convnet on the lower right 14 by 14 medium.

And if you step through all the steps of the calculation,

let's look at the green example,

if you had cropped out just this region

and passed it through the convnet through the convnet on top,

then the first layer's activations would have been exactly this region.

The next layer's activation after max pooling would have been

exactly this region and then the next layer,

the next layer would have been as follows.

So what this process does,

what this convolution implementation does is,

instead of forcing you to run four propagation

on four subsets of the input image independently, Instead,

it combines all four into one form of computation and shares

a lot of the computation in the regions of image that are common.

So all four of the 14 by 14 patches we saw here.

Now let's just go through a bigger example.

Let's say you now want to run sliding windows on a 28 by 28 by 3 image.

It turns out If you run four from

the same way then you end up with an 8 by 8 by 4 output.

And just go small and surviving sliding windows with that 14 by 14 region.

And that corresponds to running a sliding windows first on that region thus,

giving you the output corresponding the upper left hand corner.

Then using a slider too to shift one window over,

one window over, one window over and so on and the eight positions.

So that gives you this first row and then as you go down the image as well,

that gives you all of these 8 by 8 by 4 outputs.

Because of the max pooling up too that this corresponds to

running your neural network with a stride of two on the original image.

So just to recap,

to implement sliding windows,

previously, what you do is you crop out a region.

Let's say this is 14 by 14

and run that through your convnet and do that for the next region over,

then do that for the next 14 by 14 region,

then the next one, then the next one,

then the next one, then the next one and so on,

until hopefully that one recognizes the car.

But now, instead of doing it sequentially,

with this convolutional implementation that you saw in the previous slide,

you can implement the entire image,

all maybe 28 by 28 and convolutionally make all the predictions at

the same time by one forward pass through this big convnet

and hopefully have it recognize the position of the car.

So that's how you implement sliding windows

convolutionally and it makes the whole thing much more efficient.

Now, this [inaudible] still has one weakness,

which is the position of the bounding boxes is not going to be too accurate.

In the next video,

let's see how you can fix that problem.