From: Object detection using convolutional neural networks and transformer-based models: a review
Task | Method | Design | Highlights | Limitations |
---|---|---|---|---|
Image classification | ViT [46] | Encoder—NLP transformer for images | Transformer (global self-attention) applied on patches of image | Large-scale dataset training (image size—300 M) |
Linearly embedding—image patches by positional embedding | Convolution-free network | Careful TL for new task | ||
Outperforms ResNet | Large model consisting 632 M parameters for SOTA results | |||
It also attained excellent results while requiring substantially fewer computational resources for training purpose (i.e. almost four times in terms of computational efficiency and accuracy) | ||||
Object detection | RCNN [1] | Resized and cropped regions classification by CNN | Frist real-time efficient object detection model using CNNs | Slow training and detection |
Region proposal BB refinement by SVM, trained by CNN features | Allows custom region proposal | More training time for classification of 2000 region proposals each image | ||
Selective search method is adopted, thus, no learning in the stage | ||||
Generates bad region proposals | ||||
Fast RCNN [2] | Fast R-CNN along with edge boxes applied for region proposals generation | R-CNN must classify every region | Fast R-CNN performance is low due to region proposals identification | |
R-CNN used for cropping and resizing region proposals | Fast R-CNN pools features from CNN belongs region proposal | Fast R-CNN is good when not region proposals. Therefore, estimating region proposals | ||
Fast R-CNN handles entire image | Fast R-CNN efficiently worked compared to R-CNN, because, estimations are shared for overlapping regions | |||
Faster RCNN [3] | No selective search method | Optimal run-time performance | RPN training is performed with all anchors (mini-batch—size 256) | |
Separate network for predicting region proposals | Improvement over its predecessor with respect to run-time speed and raw performance | Extraction perform on single image | ||
Region proposals prediction reshaped by layer of RoI pooling, later used to classify image under proposed region, further, prediction of offset for BB | RPN is faster as compared to Selective Search | Network convergence is slow | ||
Fixed-length features are extracted from every region proposal by layer of RoI pooling | ||||
DETR [28] | Linear projection layer is applied for CNN feature dimension reduction | End-to-end pipeline of training using transformer for OD | More time for convergence | |
Encoder–decoder consists of spatial positional embedding at each layer of multi-head self-attention | No manual post-processing stage | Low detection accuracy for small objects | ||
Output positional encoding, i.e. object queries, is added to the layer of multi-head self-attention in decoder | ||||
Hungarian loss is used | ||||
D-DETR [29] | Deformable transformer with deformable attention layers for sparse priors | Better performance for small object compared to DETR | SOTA results using 52.3 AP | |
Applied multi-scale attention | Converged fast compared to DETR | Augmentation in test time | ||
VITFRCNN [18] | Transformers for encoding visual features while RPN for detecting outputs | Pre-training capacity is large | More training time on large-scale dataset (300 M) | |
Adds detection network to ViT | Fine-tuning performance is fast | Training from scratch is difficult for smaller datasets | ||
ViT used for state related to input class token and is outputted through MLP classification head | Investigated improvements—superior performance is reported for image in out-of-domain, and better performance for large objects | GPU memory limitation | ||
Avoiding spurious over detections | Self-attention and convolutional layers relationship, and limitations of CNNs | |||
YOLO [6] | YOLO predicts the BB | YOLO is faster than other OD algorithms | Small objects detection issue | |
YOLO estimates the class probabilities for BB | Better accuracy of prediction and better IoU in BB | Spatial constraints lower small objects detection | ||
YOLOS [59] | Transformer block like YOLO CNN-based model | 2D OD is accomplished in pure sequence-to-sequence way with optimal addition of inductive biases | 150 epochs needed for TL | |
In addition to ViT, this model consists of detector portion of the network that maps a generated sequence of detection representations to class and box prediction | Performance is encouraging | Learning through visual representation | ||
Preliminary outcomes are significant | ||||
Rank-DETR[70] | Rank-based design with | SOTA is improve by Rank-DETR | Rank-based design needs to be explored more | |
prompt engineering | Backbone of ResNet-50, Swin-L, and Swin-T is used for enhancing localization accuracy | More computing time | ||
Rank-based loss calculation, matching cost for accurate localization accuracy rank | More AP under higher IoU | |||
Semantic segmentation | Mask RCNN [8] | Mask R-CNN used for predicting an object mask (RoI), also recognized BB | Simple to train and outperforms state-of-the-arts | Process still images, not capable to explore temporal details of object |
Perform image segmentation, i.e. Semantic and Instance | Contains small overhead compared to faster R-CNN | Fails in detecting object suffering from low resolution | ||
Generalization is easy | ||||
ViT Segmenter [57] | Encoder: projecting image patches in sequential embeddings, further, encoding using transformer | Global context captured by transformer | Not computationally feasible | |
Decoder: mask transformer result from encoder, class embeddings, further, predicts segmentation masks | Decoder: simple point-wise linear is applied to patch encodings for better results | Reduction in patch size requires the computation of attention along longer sequences | ||
Unified model for semantic and instance segmentation, and segmentation of panoptic | More computing time |