0:00

In the last video you learned about the sliding windows object detection

Â algorithm using a consonant but we thought that it was too slow in this

Â video you learn how to implement that algorithm convolutional e let's see what

Â this means to build up toward the convolutional implementation of sliding

Â windows let's first see how you can turn fully connected layers in your neural

Â network into convolutional layers we'll do that first on this slide and then the

Â next slide we'll use the ideas from this slide to show you the convolutional

Â implementation so let's say that your object detection algorithm inputs 14 by

Â 14 by 3 images this is quite small but just for illustrative purposes and let's

Â say it then uses five by five filters and let's say uses 16 of them to map it

Â from 14 by 14 by 3 to 10 by 10 by 16 and then does a 2 by 2 max pooling to reduce

Â it to 5 a 5 by 16 then has a fully connected layer to connect at 400 units

Â then another fully connected layer and then finally outputs Y using a soft max

Â unit in order to make the change we'll need to in a second I'm going to change

Â this picture a little bit and instead I'm going to view Y as 4 numbers

Â corresponding to the crossed probabilities of the 4 classes that the

Â softmax unit is classifying amongst and the 4 classes could be pedestrian car

Â motorcycle and background or something else now what I'd like to do is show how

Â these layers can be turned into convolutional layers so the confident

Â with a draw same as before for the first few layers and now one way of

Â implementing this next layer this fully connected layer is to implement this as

Â a 5 by 5 filter and let's use 400 5x5 filters so if you take a 5 by 5 by 16

Â image and convolve it with a 5 by 5 filter remember a 5 by 5 filter is

Â implemented as 5 by 5 sixteen because our convention is that

Â the filter looks like cross all 16 channels so the 16 and just 16 must

Â match and so the output will be one by one and if you have 400 of these 5 by 5

Â by 16 filters then the output dimension is going to be 1 by 1 by 400 and so

Â rather than viewing these 400 as just a set of nodes we're going to view this as

Â a 1 by 1 by 400 volume and mathematically this is the same as a

Â fully connected layer because each of these 400 notes has a filter of

Â dimension 5 by 5 by 16 and so each of those 400 values is some you know

Â arbitrary linear function of these 5 by 5 by 16 activations from the previous

Â layer next to implement the next convolutional layer we're going to

Â implement a 1 by 1 convolution and if you have 400 1 by 1 filters then with

Â 400 filters the next layer will again be 1 by 1 by 400 so that gives you this

Â next fully connected layer and then finally we're going to have another one

Â by one filter followed by a softmax activation so as to give a 1 by 1 by 4

Â volume to take the place of these 4 numbers that the network was outputting

Â so this shows how you can take these fully connect two layers and implement

Â them using convolutional layers so that these sets of units instead are now

Â implemented as one by one by 400 and one by one by four volumes armed of this

Â conversion let's see how you can have a convolutional implementation of sliding

Â windows object detection and presentation on the slide is based on

Â the over feet paper reference at the bottom by piercer meet David Egan Xia

Â Jiang micro Matthew Ron Ferguson Oakland let's say that your sliding windows

Â confident inputs 14 by 14 by 3 images and again I'm just using small numbers

Â like a 14 by 14 image in this slide may need to make the numbers and

Â illustrations simpler so as before you have a neural network as follows that

Â eventually outputs a 1 by 1 by 4 volume which is the output of your softmax unit

Â and again to simplify the drawing here 14 by 14 by 3 is technically a volume 5

Â by 5 or 10 by 10 by 16 a second clear volume but to simplify the drawing for

Â this slide I'm just only draw the front face of these volumes so instead of

Â drawing you know 1 by 1 by 400 volume I'm just only draw the 1 by 1 parts of

Â all of these right so just drop the 3d component of these drawings just for

Â this slide so let's say that you're confident inputs 14 by 14 images or 14

Â by 14 by 3 images and your test set image is 16 by 16 by 3 so now added that

Â yellow stripe to the border of this image so in the original sliding windows

Â algorithm you might want to input the blue region into a confident and run

Â that once to generate a classification 0 1 and then slide it down a bit let's use

Â the stride of 2 pixels and and then you might slide oh and then you might slide

Â that to the right by 2 pixels to input this green rectangle into the confident

Â and rerun the whole company and get another label 0 1 and then you might

Â input this orange region into the confident and run it one more time to

Â get another label and then do the fourth and final time with this your lower

Â right now purple square and to run sliding windows on this 16 by 16

Â by 3 image was pretty small image you run this confident from above 4 times in

Â order to forget for labels but it turns out a lot of this computation done by

Â these 4 confidence is highly duplicated so well the convolutional implementation

Â of sliding windows does is it allows these 4 forward passes of the confident

Â to share a lot of computation specifically here's what you can do you

Â can take the confidence and just run it same parameters the same 5 by 5 by

Â filters also 16 5 by 5 filters and run it and now you can have a 12 by 12 by 16

Â output volume and then do the math school same as before now you have a 6

Â by 6 by 16 run through your same 400 five by five filters to get now your 2

Â by 2 by 40 volume so now instead of a 1 by 1 so now instead of a 1 by 1 by 400

Â volume you have the said a 2 by 2 by 400 volume Ramnath 301 by 1 filter gives you

Â another two by two by 400 and said 1 by 1 by 400 do that one more time and now

Â you're left with a 2 by 2 by 4 output volume is that 1 by 1 by 4 and it turns

Â out that this blue one by one by four subset gives you the result of running

Â in the upper left-hand corner 14 by 14 image this upper right 1 by 1 by 4

Â volume gives you the upper right result the lower left gives you the results of

Â implementing the content on the lower left 14 by 14 region and the lower right

Â 1 by 1 by 4 volume gives you the same result as running the confident on the

Â lower right 14 by 14 media and if you step through all the steps of the

Â calculation let's look at the green example if you had cropped out just this

Â region and passed it through the confident through the confident on top

Â then the first layers activations would have been exactly this region

Â the next layers activation of the mass cooling would have been exactly this

Â region and then the next layer the next layer would have been as follows so what

Â this process does what this convolutional inclination does is

Â instead of forcing you to run for propagation on four subsets of the input

Â image independently instead it combines all four into one for computation and

Â shares a lot of the computation in the regions of the image that are common so

Â all four of the 14 by 14 patches we saw here now let's just go through a bigger

Â example let's say you now want to run sliding windows on a 28 by 28 by 3 image

Â it turns out if you run for a crop the same way then you end up with an 8 by 8

Â by 4 output and this corresponds to running sliding windows with that 14 by

Â 14 region and that corresponds to running sliding windows first on that

Â region does giving you the output corresponding on the upper left-hand

Â corner then using Australia to to shift one window over one window over one

Â window over and so on and there ain't position so that gives you this first

Â row and then as you go down the image as well that gives you all of these 8 by 8

Â by 4 outputs and the N is because of the max pooling of 2 that this corresponds

Â to running your neural network with a stride of 2 on the original image

Â so just to recap to implement sliding windows previously what you do is you

Â drop out a region let's say this is on 14 by 14 and run that to your confident

Â and do that for the next region over then do that for the next 14 by 14

Â region then the next one then the next one the next one the next one and so on

Â until hopefully that one recognizes the car but now instead of doing it

Â sequentially with this convolutional implementation that you saw in the

Â previous slide you can implement the entire image or maybe twenty by twenty

Â eight and convolutional 'i make all the predictions at the same time by one for

Â it pass through this big confident we have it recognize the position of the

Â car so that's how you implement sliding windows convolutional v and it makes the

Â whole thing much more efficient now this algorithm still has one weakness which

Â is the position of the bounding boxes is not going to be too accurate in the next

Â video let's see how you can fix that problem

Â