Skip to main content
QUICK REVIEW

[Paper Review] ResT: An Efficient Transformer for Visual Recognition

Qinglong Zhang, Yu-Bin Yang|arXiv (Cornell University)|May 28, 2021
Advanced Neural Network Applications36 references148 citations
TL;DR

ResT introduces a memory-efficient multi-scale Vision Transformer backbone with EMSA attention, flexible spatial positional encoding, and overlapping patch embeddings, achieving strong ImageNet and COCO results.

ABSTRACT

This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition. Unlike existing Transformer methods, which employ standard Transformer blocks to tackle raw images with a fixed resolution, our ResT have several advantages: (1) A memory-efficient multi-head self-attention is built, which compresses the memory by a simple depth-wise convolution, and projects the interaction across the attention-heads dimension while keeping the diversity ability of multi-heads; (2) Position encoding is constructed as spatial attention, which is more flexible and can tackle with input images of arbitrary size without interpolation or fine-tune; (3) Instead of the straightforward tokenization at the beginning of each stage, we design the patch embedding as a stack of overlapping convolution operation with stride on the 2D-reshaped token map. We comprehensively validate ResT on image classification and downstream tasks. Experimental results show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones. The code and models will be made publicly available at https://github.com/wofmanaf/ResT.

Motivation & Objective

  • Develop a general-purpose, backboned architecture for image recognition that blends CNN locality with Transformer global reasoning.
  • Reduce memory and computational cost of self-attention while preserving multi-head diversity.
  • Enable flexible input sizes and multi-scale feature maps for dense prediction tasks.
  • Validate ResT on ImageNet-1k classification and downstream tasks like object detection and instance segmentation.
  • Demonstrate that ResT outperforms comparable backbones at similar model sizes.

Proposed method

  • Introduce Efficient Multi-head Self-Attention (EMSA) that uses depth-wise convolution to compress spatial tokens and projects interactions across attention-heads.
  • Replace fixed patch tokenization with overlapping convolution-based patch embedding to build a multi-scale feature pyramid.
  • Define positional encoding as spatial attention (PA) to handle variable input sizes without interpolation or fine-tuning.
  • Incorporate a 1×1 convolution + Instance Normalization within EMSA to restore head diversity and stabilize training.
  • Use stage-wise patch embedding to progressively grow channel dimensions and reduce spatial resolution, forming a ResT-like backbone.
  • Adopt pre-normalization in downstream frameworks and a simple global average pooling classifier for ImageNet-1k evaluation.

Experimental results

Research questions

  • RQ1How can self-attention be made memory-efficient for Vision Transformer backbones without sacrificing performance?
  • RQ2Can spatially conditioned positional encodings enable flexible input sizes and multi-scale representations for dense prediction?
  • RQ3Do overlapping patch embeddings improve low-level feature capture and overall accuracy compared to standard tokenization?
  • RQ4What performance gains do ResT backbones provide on ImageNet-1k and COCO object detection/instance segmentation relative to similar-cost backbones?

Key findings

Model#Params (M)FLOPs (G)Throughput (images/s)Top-1 (%)Top-5 (%)
ResT-Lite10.491.4124677.2 (↑7.5)93.7 (↑4.6)
ResT-Small13.661.9104379.6 (↑9.9)94.9 (↑5.8)
ResT-Base30.284.367381.6 (↑2.6)95.7 (↑1.3)
ResT-Large51.637.942983.6 (↑3.3)96.3 (↑1.1)
  • ResT-Small achieves 79.6% Top-1 accuracy on ImageNet-1k with 1.9G FLOPs and 13.66M parameters.
  • ResT-Large reaches 83.6% Top-1 accuracy with 7.9G FLOPs and 51.63M parameters, outperforming similarly priced Swin variants.
  • On COCO object detection with RetinaNet, ResT-Small improves AP by 3.6 points over PVT-T (40.3 vs 36.7).
  • On COCO object detection with RetinaNet, ResT-Base improves AP by 1.6 points over PVT-S (42.0 vs 40.4).
  • ResT-Large delivers strong gains in Mask RCNN-based instance segmentation (APbox 41.6, APmask 38.7) compared to PVT-S and Swin variants at similar budgets.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.