Basic R-CNN model has been rapidly modified.

In this section, I will explain changes that have been

introduced to R-CNN after Fast R-CNN model.

Base CNN classifier used in R-CNN model due to fully convolutional layers,

requires an image with fixed resolution of 224 by 224 pixels as input.

So, after objects proposals are extracted,

they should be re-scaled to the sixth resolution,

such as scaling changes object appearance and increased variability of images.

But we can adapt any CNN classifier to

various image resolutions by changing last pooling layer,

to the Spatial Pyramid Pooling or SPP layer.

The idea of Spatial Pyramid Pooling is

derived from traditional Bag of Visual Words algorithm.

Spatial Pyramid is constructed on top of the region of interest.

First level of the pyramid is a region of interest itself.

On second level, the region is divided into four cells with two by two grid.

On third level, region is divided into 16 cells on four by four grid.

Average pooling is applied to each cell.

So, if the last convolutional layer has 256 maps,

then pooling in each cell produce one vector with length, 256.

Feature vectors for all cells are concatenated,

and then passed as input to the fully convolutional layer.

Last, we obtained fixed length feature representation

for input images of various resolutions.

Now, we can apply convolutional layers of Base CNN only once per input image.

For each window, we apply Spatial Pyramid Pooling,

and then compute fully convolutional layers for feature extraction.

Compare this to compute and convolutional features for each of the 2,000 object proposal.

This leads to a dramatic increase in processing speed.

In this Fast R-CNN paper,

Girshick has proposed to use the Region of Interest Pooling layer, or ROI layer.

It's a simplified depiction of SPP layer,

there is only one pyramid layer.

In fast R-CNN, in addition to Region of Interest Pooling,

two more modifications have been introduced.

First, softmax classifier is used instead of SVM classifier.

Second, multi-task training is used to train

classifier and bounding box regressor simultaneously.

Fast R-CNN walks us forward.

The input image and set of object proposals are supplied to the neural network.

The neural network produces a convolutional feature map.

From convolutional feature map,

feature vectors are extracted using Region of Interest Pooling layer.

Then, the feature vectors affect into a sequence of fully convolutional layers.

The output of fully convolutional layers are branched into K-way softmax,

and K by four real valued bounding box coordinates output.

Fast R-CNN can be trained to this multi-task loss.

This multi-task loss is the rate at sum of classification loss,

and bounding box regression loss.

For the true class U, log loss is used.

The smooth L1 loss is used for the bounding box regression.

Because there is no separate SVM training,

Fast R-CNN allows end-to-end training,

is much faster, and without intensive write and read to the hard drive.

It's also empirically demonstrated,

the detector precision is improved from multi-task learning.

During R-CNN training,

128 Region of Interest are sampled from training set at random for each mini batch.

Examples are most likely to come from different images.

But for the Fast R-CNN,

when we use different images in one batch,

the computations are expensive for each window,

because in Fast R-CNN convolutional features extracted from the whole image,

the receptive field for the Region of Interest Pooling is very large.

In worst case, it can be entire image.

So for each example,

we need to compute convolutional features for the entire image.

If each example comes from a different image,

then feature extraction for Fast R-CNN is much lower than that of R-CNN.

So the following compromise is made.

First, they sample a small number of images,

for example two, many examples from each image, for example 64.

In this case, the feature computations are used for 64 examples.

In the training of Fast R-CNN becomes much faster compared to the simple R-CNN.

Training and test time for the Fast R-CNN is lower than that of R-CNN and SPP net.

Accuracy of Fast R-CNN is also higher.

The one last weak point of R-CNN,

is the dependency on external hypothesis generation method.

From the table, you can see that most of test time is

attributed to the Selective Search for the proposal generation.