Object detection using convolutional neural networks and transformer-based models: a review

Shah, Shrishti; Tembhurne, Jitendra

doi:10.1186/s43067-023-00123-z

Journal of Electrical Systems and Information Technology

Table 1 Summary of state-of-the-arts for semantic segmentation, OD, and image classification

From: Object detection using convolutional neural networks and transformer-based models: a review

Task	Method	Design	Highlights	Limitations
Image classification	ViT [46]	Encoder—NLP transformer for images	Transformer (global self-attention) applied on patches of image	Large-scale dataset training (image size—300 M)
		Linearly embedding—image patches by positional embedding	Convolution-free network	Careful TL for new task
			Outperforms ResNet	Large model consisting 632 M parameters for SOTA results
			It also attained excellent results while requiring substantially fewer computational resources for training purpose (i.e. almost four times in terms of computational efficiency and accuracy)
Object detection	RCNN [1]	Resized and cropped regions classification by CNN	Frist real-time efficient object detection model using CNNs	Slow training and detection
		Region proposal BB refinement by SVM, trained by CNN features	Allows custom region proposal	More training time for classification of 2000 region proposals each image
				Selective search method is adopted, thus, no learning in the stage
				Generates bad region proposals
	Fast RCNN [2]	Fast R-CNN along with edge boxes applied for region proposals generation	R-CNN must classify every region	Fast R-CNN performance is low due to region proposals identification
		R-CNN used for cropping and resizing region proposals	Fast R-CNN pools features from CNN belongs region proposal	Fast R-CNN is good when not region proposals. Therefore, estimating region proposals
		Fast R-CNN handles entire image	Fast R-CNN efficiently worked compared to R-CNN, because, estimations are shared for overlapping regions
	Faster RCNN [3]	No selective search method	Optimal run-time performance	RPN training is performed with all anchors (mini-batch—size 256)
		Separate network for predicting region proposals	Improvement over its predecessor with respect to run-time speed and raw performance	Extraction perform on single image
		Region proposals prediction reshaped by layer of RoI pooling, later used to classify image under proposed region, further, prediction of offset for BB	RPN is faster as compared to Selective Search	Network convergence is slow
			Fixed-length features are extracted from every region proposal by layer of RoI pooling
	DETR [28]	Linear projection layer is applied for CNN feature dimension reduction	End-to-end pipeline of training using transformer for OD	More time for convergence
		Encoder–decoder consists of spatial positional embedding at each layer of multi-head self-attention	No manual post-processing stage	Low detection accuracy for small objects
		Output positional encoding, i.e. object queries, is added to the layer of multi-head self-attention in decoder
		Hungarian loss is used
	D-DETR [29]	Deformable transformer with deformable attention layers for sparse priors	Better performance for small object compared to DETR	SOTA results using 52.3 AP
		Applied multi-scale attention	Converged fast compared to DETR	Augmentation in test time
	VITFRCNN [18]	Transformers for encoding visual features while RPN for detecting outputs	Pre-training capacity is large	More training time on large-scale dataset (300 M)
		Adds detection network to ViT	Fine-tuning performance is fast	Training from scratch is difficult for smaller datasets
		ViT used for state related to input class token and is outputted through MLP classification head	Investigated improvements—superior performance is reported for image in out-of-domain, and better performance for large objects	GPU memory limitation
			Avoiding spurious over detections	Self-attention and convolutional layers relationship, and limitations of CNNs
	YOLO [6]	YOLO predicts the BB	YOLO is faster than other OD algorithms	Small objects detection issue
		YOLO estimates the class probabilities for BB	Better accuracy of prediction and better IoU in BB	Spatial constraints lower small objects detection
	YOLOS [59]	Transformer block like YOLO CNN-based model	2D OD is accomplished in pure sequence-to-sequence way with optimal addition of inductive biases	150 epochs needed for TL
		In addition to ViT, this model consists of detector portion of the network that maps a generated sequence of detection representations to class and box prediction	Performance is encouraging	Learning through visual representation
			Preliminary outcomes are significant
	Rank-DETR[70]	Rank-based design with	SOTA is improve by Rank-DETR	Rank-based design needs to be explored more
		prompt engineering	Backbone of ResNet-50, Swin-L, and Swin-T is used for enhancing localization accuracy	More computing time
		Rank-based loss calculation, matching cost for accurate localization accuracy rank	More AP under higher IoU
Semantic segmentation	Mask RCNN [8]	Mask R-CNN used for predicting an object mask (RoI), also recognized BB	Simple to train and outperforms state-of-the-arts	Process still images, not capable to explore temporal details of object
		Perform image segmentation, i.e. Semantic and Instance	Contains small overhead compared to faster R-CNN	Fails in detecting object suffering from low resolution
			Generalization is easy
	ViT Segmenter [57]	Encoder: projecting image patches in sequential embeddings, further, encoding using transformer	Global context captured by transformer	Not computationally feasible
		Decoder: mask transformer result from encoder, class embeddings, further, predicts segmentation masks	Decoder: simple point-wise linear is applied to patch encodings for better results	Reduction in patch size requires the computation of attention along longer sequences
			Unified model for semantic and instance segmentation, and segmentation of panoptic	More computing time

Back to article page