Step by step -Understand the architecture of Region proposal network (R-CNN)

9 min readSep 5, 2020


This blog is written to explain the evolution of object detection models in simple words and self-explanatory diagrams. This blog can be helpful to every individual who is entering into the field of computer vision and data science or has taken up a project which requires solving an object detection problem.

We all must have heard about Faster R-CNN and there are high chances that you found this blog when you searched for the keyword “Faster R-CNN” as it has been among the state of arts used in many fields since January 2016.

A strong object detection architecture like Faster RCNN is built upon the successful research like R-CNN and Fast R-CNN. To honestly enjoy working, troubleshooting and pursuing the dream of creating your own model which can one day be called a state of art my friend I would always recommend reading the research papers in chronological order.

Therefore I have tried my best to explain all of the three architectures so that we do not miss on the basics. One day we will build our architecture and contribute to the field we are passionate about.

I will also talk about the stories of my struggles while reading these papers and the attitude we must have to read fearlessly. To keep the spirit high, I want to state the quote which kept me motivated throughout reading the papers and finally writing this blog for you all. Reading is important because if you can read, you can learn anything about everything and everything about anything.-Tomie Depaola

Region proposal network(R-CNN)

The paper that talks about R-CNN is Rich feature hierarchies for accurate object detection and semantic segmentation which is popularly known as R-CNN. It was published in the year 2014.

How it all started?

In the year 2004 papers like Distinctive Image Features from Scale-Invariant Keypoints and Histograms of Oriented Gradients for Human Detection describes the use of SIFT and HOG features to address visual recognition task. But if we look at performance on the canonical visual recognition task, PASCAL VOC object detection challenge, it is generally acknowledged that progress has been slow during 2010–2012, with small gains obtained by building ensemble systems and employing minor variants of successful methods.

Then in the year, 2012 ImageNet Classification with Deep Convolutional Neural Networks made its mark in the field of CNN’s by showing substantially higher image classification accuracy on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Their success resulted from training a large CNN on 1.2 million labelled images, together with a few twists on CNN (e.g., max(x; 0) rectifying nonlinearities and “dropout” regularization).

The significance of the ImageNet result was vigorously debated during the ILSVRC 2012 workshop. The central issue can be distilled to the following: To what extent do the CNN classification results on ImageNet generalize to object detection results on the PASCAL VOC Challenge?

The R-CNN paper answers this question by bridging the gap between image classification and object detection. This paper is the first to show that a CNN can lead to dramatically higher object detection performance on PASCAL VOC as compared to systems based on simpler HOG-like features.

A challenge faced in detection is that labelled data is scarce and the amount currently available is insufficient for training a large CNN.

The conventional solution to this problem is to use unsupervised pre-training, followed by supervised fine-tuning. The second principle contribution of this paper is to show that supervised pre-training on a large auxiliary dataset (ILSVRC), followed by domain-specific fine-tuning on a small dataset (PASCAL), is an effective paradigm for learning high-capacity CNNs when data is scarce. During the experiments, fine-tuning for detection improved the mAP performance by 8 percentage points. After fine-tuning, R-CNN achieved an mAP of 54% on VOC 2010 compared to 33% for the highly-tuned, HOG-based deformable part model (DPM). The authors of R-CNN also mentions the paper DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition which is a black box feature extractor, yielding excellent performance on several recognition tasks including scene classification, fine-grained sub-categorization, and domain adaptation.

As R-CNN operates on regions it is natural to extend it to the task of semantic segmentation. With minor modifications, the authors also achieved competitive results on the PASCAL VOC segmentation task, with an average segmentation accuracy of 47.9% on the VOC 2011 test set.

We know that recognition occurs several stages downstream, which suggests that there might be hierarchical, multi-stage processes for computing features that are even more informative for visual recognition.

To achieve this result, R-CNN focused on two problems: localizing objects with a deep network and training a high-capacity CNN model with only a small quantity of annotated detection data. In order to maintain high spatial resolution, CNNs those days typically had only had two convolutional and pooling layers.

Units high up in R-CNN network, which had five convolutional layers, have very large receptive fields (195 x195 pixels) and strides (32x32 pixels) in the input image, which makes precise localization within the sliding-window paradigm an open technical challenge.

R-CNN solve the CNN localization problem by operating within the “recognition using regions” paradigm, which has been successful for both object detection and semantic segmentation.

How do R-CNN works during training?

The diagram explains the first four steps of R-CNN. 1. Selective search for region proposal 2. Image warping 3. Use of supervised pre-trained CNN 4. Domain-specific training

Step-1 Region Proposal using the Selective Search Algorithm

R-CNN generates 2000 category-independent region proposals for the input image, these 2000 regions are proposed using the selective search.

Selective Search is a method for finding a large set of possible object locations in an image, independent of the class of the actual object. It works by clustering image pixels into segments and then performing hierarchical clustering to combine segments from the same object into object proposals.

Images in the first-row represent proposed regions by selective search algorithm in semantic format and images in the second-row represents bounding box respective to each semantic region in the first row.

The selective search uses multiple image processing algorithms to output these proposals.

Input — RGB image

Output — 2000 proposed regions which consist of background and the object classes

Step-2 Formatting the proposed regions

The above image shows the different object proposal transformations. (A) the original object proposal at its actual scale (B) tightest square with context. (C ) the tightest square without context. (D) warped proposed region.

These regions are converted to bounding boxes. These bounding boxes have precise information about pixel locations of proposed regions on the image

Regardless of the size or aspect ratio of the candidate region, the authors warped all pixels in a tight bounding box to the required size. A simple technique (affine image warping) to compute a fixed-size CNN input from each region proposal, regardless of the region’s shape.

Input — 2000 proposed regions which consist of background and the object classes

Output- 2000 warped images of fixed size

Step-3 Supervised pre-training

During those times when we had limited domain-specific annotated dataset and computation capabilities, the authors adopted a method called as Supervised pre-training where they pre-trained the CNN on a large auxiliary dataset (ILSVRC2012 classification) using image-level annotations only (bounding box labels were not available for this data). This could be considered as a way to use pre-trained weights.

Input — Classification dataset which exhibited feature similarities with object classes to be used for Domain-specific training and has ample amount if data to be trained.

Output- A CNN when given an image can classify those images into classes based on the object that it has in those images.

Step-4 Domain-Specific training

After the Supervised pre-training, the next step was a domain-specific fine-tuning. where the pre-trained CNN was trained with the 2000 fixed size warped images.

Each warped bounding box is then forward propagated through a CNN network. During the forward propagation, a convolution neural network does feature extraction from each of the proposed regions. So from every input image, 2000 warped images would enter the CNN sequentially (one at a time). Now one can imagine the time the network will consume to process a single image.

Then features are extracted by CNN and warped into a fixed-length feature vector. Then each fixed-length feature vector is classified using category-specific linear Support vector machine(set of class-specific linear SVMs).

To fine-tune the CNN they used stochastic gradient descent (SGD) at a starting learning rate of 0.001. During the training, they treated all region proposals (warped images) with 0.5 IoU overlap with a ground-truth box as positives for that box’s class and the rest as negatives.

Backwards propagation

During backpropagation, they uniformly sample(sampling is called as a process of picking data from a pool of dataset) 32 positive windows (overall classes) and 96 background windows to construct a mini-batch of size 128. They biased the sampling towards positive windows because they are extremely rare compared to the background.

Step-5 Refining the classification results by CNN using Support vector machine(SVM) and bounding box regression

Once the domain-specific CNN is trained and we start getting the classification and bounding boxes. But we know that the layers of the CNN were not efficient to deliver high precision and recall. So to make the model robust SVM’s were trained. The SVM was trained only after the domain-specific CNN was trained with satisfactory performance.

To train the SVM the 2000 regions were proposed by selective search, then these regions were warped and then sent through the trained domain-specific CNN which extracts features from 2000 warped regions and then convert them into feature vectors. These feature vectors (output from CNN) were used as input to train SVM. The confidence threshold that was used to train SVM as a positive and negative object class was 0.3. The trained CNN would output the class of every feature vector. Based on this prediction information by CNN the vectors would be used as input to train class specific linear SVM's. One SVM for one class. Example: If a CNN classifies the feature vector belonging to a class “cat” then that feature vector would be used to train the CAT SVM.

During training the class-specific SVM’s a class-specific bounding box regressor is also trained simultaneously. The input to the regressor is the feature vectors computed by domain-specific CNN. The regressor helps in predicting tight bounding boxes.

SVM input- Class-specific feature vectors predicted by the trained domain-specific model.

SVM Output — Precisely classified feature vectors

What happens during testing the R-CNN model?

Step-1 Selective search on the test image to extract around 2000 region proposals.

Step-2 Each proposal is warped and forward propagate it through the domain-specific CNN in order to compute features.

Step-3 Then, for each class, the authors score each extracted feature vector using the SVM trained for that class. Given all scored regions in an image, they apply a greedy non-maximum suppression (for each class independently) that rejects a region if it has an intersection-over union (IoU) overlap with a higher scoring selected region larger than a learned threshold.


Hope I was able to help you understand how R-CNN works. I have taken some efforts to create self-explanatory diagrams. I searched and read R-CNN from many sources but always found a very high-level diagram with precise literature. But since there has bee so much advancement done over this model that many people read and write about it. I have tried my best to formulate it on this page. Certainly, this model is not used these days but if you want to build your own model someday I would highly recommend you to read all the initial days research paper. You can read about other models like Fast, Faster-RCNN, YOLO V1, V2, V3 and Single-shot detector in my next blogs.

If you found this blog helpful, please give a clap so that others may find it.





Computer Vision contributor. Lead Data Scientist @ Love Data Science.