Skip to main content

Table 1 Summary of state-of-the-arts for semantic segmentation, OD, and image classification

From: Object detection using convolutional neural networks and transformer-based models: a review

Task

Method

Design

Highlights

Limitations

Image classification

ViT [46]

Encoder—NLP transformer for images

Transformer (global self-attention) applied on patches of image

Large-scale dataset training (image size—300 M)

 

Linearly embedding—image patches by positional embedding

Convolution-free network

Careful TL for new task

  

Outperforms ResNet

Large model consisting 632 M parameters for SOTA results

  

It also attained excellent results while requiring substantially fewer computational resources for training purpose (i.e. almost four times in terms of computational efficiency and accuracy)

 

Object detection

RCNN [1]

Resized and cropped regions classification by CNN

Frist real-time efficient object detection model using CNNs

Slow training and detection

  

Region proposal BB refinement by SVM, trained by CNN features

Allows custom region proposal

More training time for classification of 2000 region proposals each image

    

Selective search method is adopted, thus, no learning in the stage

    

Generates bad region proposals

 

Fast RCNN [2]

Fast R-CNN along with edge boxes applied for region proposals generation

R-CNN must classify every region

Fast R-CNN performance is low due to region proposals identification

  

R-CNN used for cropping and resizing region proposals

Fast R-CNN pools features from CNN belongs region proposal

Fast R-CNN is good when not region proposals. Therefore, estimating region proposals

  

Fast R-CNN handles entire image

Fast R-CNN efficiently worked compared to R-CNN, because, estimations are shared for overlapping regions

 
 

Faster RCNN [3]

No selective search method

Optimal run-time performance

RPN training is performed with all anchors (mini-batch—size 256)

  

Separate network for predicting region proposals

Improvement over its predecessor with respect to run-time speed and raw performance

Extraction perform on single image

  

Region proposals prediction reshaped by layer of RoI pooling, later used to classify image under proposed region, further, prediction of offset for BB

RPN is faster as compared to Selective Search

Network convergence is slow

   

Fixed-length features are extracted from every region proposal by layer of RoI pooling

 
 

DETR [28]

Linear projection layer is applied for CNN feature dimension reduction

End-to-end pipeline of training using transformer for OD

More time for convergence

  

Encoder–decoder consists of spatial positional embedding at each layer of multi-head self-attention

No manual post-processing stage

Low detection accuracy for small objects

  

Output positional encoding, i.e. object queries, is added to the layer of multi-head self-attention in decoder

  
  

Hungarian loss is used

  
 

D-DETR [29]

Deformable transformer with deformable attention layers for sparse priors

Better performance for small object compared to DETR

SOTA results using 52.3 AP

  

Applied multi-scale attention

Converged fast compared to DETR

Augmentation in test time

 

VITFRCNN [18]

Transformers for encoding visual features while RPN for detecting outputs

Pre-training capacity is large

More training time on large-scale dataset (300 M)

  

Adds detection network to ViT

Fine-tuning performance is fast

Training from scratch is difficult for smaller datasets

  

ViT used for state related to input class token and is outputted through MLP classification head

Investigated improvements—superior performance is reported for image in out-of-domain, and better performance for large objects

GPU memory limitation

   

Avoiding spurious over detections

Self-attention and convolutional layers relationship, and limitations of CNNs

 

YOLO [6]

YOLO predicts the BB

YOLO is faster than other OD algorithms

Small objects detection issue

  

YOLO estimates the class probabilities for BB

Better accuracy of prediction and better IoU in BB

Spatial constraints lower small objects detection

 

YOLOS [59]

Transformer block like YOLO CNN-based model

2D OD is accomplished in pure sequence-to-sequence way with optimal addition of inductive biases

150 epochs needed for TL

  

In addition to ViT, this model consists of detector portion of the network that maps a generated sequence of detection representations to class and box prediction

Performance is encouraging

Learning through visual representation

   

Preliminary outcomes are significant

 
 

Rank-DETR[70]

Rank-based design with

SOTA is improve by Rank-DETR

Rank-based design needs to be explored more

  

prompt engineering

Backbone of ResNet-50, Swin-L, and Swin-T is used for enhancing localization accuracy

More computing time

  

Rank-based loss calculation, matching cost for accurate localization accuracy rank

More AP under higher IoU

 

Semantic segmentation

Mask RCNN [8]

Mask R-CNN used for predicting an object mask (RoI), also recognized BB

Simple to train and outperforms state-of-the-arts

Process still images, not capable to explore temporal details of object

  

Perform image segmentation, i.e. Semantic and Instance

Contains small overhead compared to faster R-CNN

Fails in detecting object suffering from low resolution

   

Generalization is easy

 
 

ViT Segmenter [57]

Encoder: projecting image patches in sequential embeddings, further, encoding using transformer

Global context captured by transformer

Not computationally feasible

  

Decoder: mask transformer result from encoder, class embeddings, further, predicts segmentation masks

Decoder: simple point-wise linear is applied to patch encodings for better results

Reduction in patch size requires the computation of attention along longer sequences

   

Unified model for semantic and instance segmentation, and segmentation of panoptic

More computing time