If you look at the object detection literature,

there's a set of ideas called region proposals

that's been very influential in computer vision as well.

I wanted to make this video optional because I tend to use

the region proposal instead of algorithm a bit less often but nonetheless,

it has been an influential body of work

and an idea that you might come across in your own work.

Let's take a look. So if you recall the sliding windows idea,

you would take a train crossfire and run it

across all of these different windows and run the detector to see if there's a car,

pedestrian, or maybe a motorcycle.

Now, you could run the algorithm convolutionally,

but one downside that the algorithm is it just

crossfires a lot of the regions where there's clearly no object.

So this rectangle down here is pretty much blank.

It's clearly nothing interesting there to classify,

and maybe it was also running it on this rectangle,

which look likes there's nothing that interesting there.

So what Russ Girshik, Jeff Donahue, Trevor Darrell,

and Jitendra Malik proposed in the paper,

as cited to the bottom of the slide,

is an algorithm called R-CNN,

which stands for Regions with convolutional networks or regions with CNNs.

And what that does is it tries to pick

just a few regions that makes sense to run your continent crossfire.

So rather than running your sliding windows on every single window,

you instead select just a few windows and

run your continent crossfire on just a few windows.

The way that they perform

the region proposals is to run an algorithm called a segmentation algorithm,

that results in this output on the right,

in order to figure out what could be objects.

So, for example, the segmentation algorithm finds a blob over here.

And so you might pick that pounding balls and say,

"Let's run a crossfire on that blob."

It looks like this little green thing finds a blob there,

as you might also run the crossfire on

that rectangle to see if there's some interesting there.

And in this case,

this blue blob, if you run a crossfire on that,

hope you find the pedestrian,

and if you run it on this light cyan blob,

maybe you'll find a car, maybe not,.

I'm not sure. So the details of this,

this is called a segmentation algorithm,

and what you do is you find maybe 2000 blobs and place bounding

boxes around about 2000 blobs and value crossfire on just those 2000 blobs,

and this can be a much smaller number of positions

on which to run your continent crossfire,

then if you have to run it at every single position throughout the image.

And this is a special case if you are running your continent

not just on square-shaped regions but running them on

tall skinny regions to try to find pedestrians or running them on

your white fat regions try to find cars and running them at multiple scales as well.

So that's the R-CNN or the region with CNN,

a region of CNN features idea.

Now, it turns out the R-CNN algorithm is still quite slow.

So there's been a line of work to explore how to speed up this algorithm.

So the basic R-CNN algorithm with proposed regions using

some algorithm and then crossfire the proposed regions one at a time.

And for each of the regions,

they will output the label.

So is there a car? Is there a pedestrian?

Is there a motorcycle there?

And then also outputs a bounding box,

so you can get an accurate bounding box if indeed there is a object in that region.

So just to be clear,

the R-CNN algorithm doesn't just trust the bounding box it was given.

It also outputs a bounding box,

B X B Y B H B W,

in order to get a more accurate bounding box and whatever happened

to surround the blob that the image segmentation algorithm gave it.

So it can get pretty accurate bounding boxes.

Now, one downside of the R-CNN algorithm was that it is actually quite slow.

So over the years,

there been a few improvements to the R-CNN algorithm.

Russ Girshik proposed the fast R-CNN algorithm,

and it's basically the R-CNN algorithm but with

a convolutional implementation of sliding windows.

So the original implementation would actually classify the regions one at a time.

So far, R-CNN use a convolutional implementation of sliding windows,

and this is roughly similar to the idea you saw in the fourth video of this week.

And that speeds up R-CNN quite a bit.