[Paper Review] ResT: An Efficient Transformer for Visual Recognition
ResT introduces a memory-efficient multi-scale Vision Transformer backbone with EMSA attention, flexible spatial positional encoding, and overlapping patch embeddings, achieving strong ImageNet and COCO results.
This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition. Unlike existing Transformer methods, which employ standard Transformer blocks to tackle raw images with a fixed resolution, our ResT have several advantages: (1) A memory-efficient multi-head self-attention is built, which compresses the memory by a simple depth-wise convolution, and projects the interaction across the attention-heads dimension while keeping the diversity ability of multi-heads; (2) Position encoding is constructed as spatial attention, which is more flexible and can tackle with input images of arbitrary size without interpolation or fine-tune; (3) Instead of the straightforward tokenization at the beginning of each stage, we design the patch embedding as a stack of overlapping convolution operation with stride on the 2D-reshaped token map. We comprehensively validate ResT on image classification and downstream tasks. Experimental results show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones. The code and models will be made publicly available at https://github.com/wofmanaf/ResT.
Motivation & Objective
- Develop a general-purpose, backboned architecture for image recognition that blends CNN locality with Transformer global reasoning.
- Reduce memory and computational cost of self-attention while preserving multi-head diversity.
- Enable flexible input sizes and multi-scale feature maps for dense prediction tasks.
- Validate ResT on ImageNet-1k classification and downstream tasks like object detection and instance segmentation.
- Demonstrate that ResT outperforms comparable backbones at similar model sizes.
Proposed method
- Introduce Efficient Multi-head Self-Attention (EMSA) that uses depth-wise convolution to compress spatial tokens and projects interactions across attention-heads.
- Replace fixed patch tokenization with overlapping convolution-based patch embedding to build a multi-scale feature pyramid.
- Define positional encoding as spatial attention (PA) to handle variable input sizes without interpolation or fine-tuning.
- Incorporate a 1×1 convolution + Instance Normalization within EMSA to restore head diversity and stabilize training.
- Use stage-wise patch embedding to progressively grow channel dimensions and reduce spatial resolution, forming a ResT-like backbone.
- Adopt pre-normalization in downstream frameworks and a simple global average pooling classifier for ImageNet-1k evaluation.
Experimental results
Research questions
- RQ1How can self-attention be made memory-efficient for Vision Transformer backbones without sacrificing performance?
- RQ2Can spatially conditioned positional encodings enable flexible input sizes and multi-scale representations for dense prediction?
- RQ3Do overlapping patch embeddings improve low-level feature capture and overall accuracy compared to standard tokenization?
- RQ4What performance gains do ResT backbones provide on ImageNet-1k and COCO object detection/instance segmentation relative to similar-cost backbones?
Key findings
| Model | #Params (M) | FLOPs (G) | Throughput (images/s) | Top-1 (%) | Top-5 (%) |
|---|---|---|---|---|---|
| ResT-Lite | 10.49 | 1.4 | 1246 | 77.2 (↑7.5) | 93.7 (↑4.6) |
| ResT-Small | 13.66 | 1.9 | 1043 | 79.6 (↑9.9) | 94.9 (↑5.8) |
| ResT-Base | 30.28 | 4.3 | 673 | 81.6 (↑2.6) | 95.7 (↑1.3) |
| ResT-Large | 51.63 | 7.9 | 429 | 83.6 (↑3.3) | 96.3 (↑1.1) |
- ResT-Small achieves 79.6% Top-1 accuracy on ImageNet-1k with 1.9G FLOPs and 13.66M parameters.
- ResT-Large reaches 83.6% Top-1 accuracy with 7.9G FLOPs and 51.63M parameters, outperforming similarly priced Swin variants.
- On COCO object detection with RetinaNet, ResT-Small improves AP by 3.6 points over PVT-T (40.3 vs 36.7).
- On COCO object detection with RetinaNet, ResT-Base improves AP by 1.6 points over PVT-S (42.0 vs 40.4).
- ResT-Large delivers strong gains in Mask RCNN-based instance segmentation (APbox 41.6, APmask 38.7) compared to PVT-S and Swin variants at similar budgets.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.