QUICK REVIEW

[Paper Review] ResT: An Efficient Transformer for Visual Recognition

Qinglong Zhang, Yu-Bin Yang|arXiv (Cornell University)|May 28, 2021

Advanced Neural Network Applications36 references148 citations

TL;DR

ResT introduces a memory-efficient multi-scale Vision Transformer backbone with EMSA attention, flexible spatial positional encoding, and overlapping patch embeddings, achieving strong ImageNet and COCO results.

ABSTRACT

This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition. Unlike existing Transformer methods, which employ standard Transformer blocks to tackle raw images with a fixed resolution, our ResT have several advantages: (1) A memory-efficient multi-head self-attention is built, which compresses the memory by a simple depth-wise convolution, and projects the interaction across the attention-heads dimension while keeping the diversity ability of multi-heads; (2) Position encoding is constructed as spatial attention, which is more flexible and can tackle with input images of arbitrary size without interpolation or fine-tune; (3) Instead of the straightforward tokenization at the beginning of each stage, we design the patch embedding as a stack of overlapping convolution operation with stride on the 2D-reshaped token map. We comprehensively validate ResT on image classification and downstream tasks. Experimental results show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones. The code and models will be made publicly available at https://github.com/wofmanaf/ResT.

Motivation & Objective

Develop a general-purpose, backboned architecture for image recognition that blends CNN locality with Transformer global reasoning.
Reduce memory and computational cost of self-attention while preserving multi-head diversity.
Enable flexible input sizes and multi-scale feature maps for dense prediction tasks.
Validate ResT on ImageNet-1k classification and downstream tasks like object detection and instance segmentation.
Demonstrate that ResT outperforms comparable backbones at similar model sizes.

Proposed method

Introduce Efficient Multi-head Self-Attention (EMSA) that uses depth-wise convolution to compress spatial tokens and projects interactions across attention-heads.
Replace fixed patch tokenization with overlapping convolution-based patch embedding to build a multi-scale feature pyramid.
Define positional encoding as spatial attention (PA) to handle variable input sizes without interpolation or fine-tuning.
Incorporate a 1×1 convolution + Instance Normalization within EMSA to restore head diversity and stabilize training.
Use stage-wise patch embedding to progressively grow channel dimensions and reduce spatial resolution, forming a ResT-like backbone.
Adopt pre-normalization in downstream frameworks and a simple global average pooling classifier for ImageNet-1k evaluation.

Experimental results

Research questions

RQ1How can self-attention be made memory-efficient for Vision Transformer backbones without sacrificing performance?
RQ2Can spatially conditioned positional encodings enable flexible input sizes and multi-scale representations for dense prediction?
RQ3Do overlapping patch embeddings improve low-level feature capture and overall accuracy compared to standard tokenization?
RQ4What performance gains do ResT backbones provide on ImageNet-1k and COCO object detection/instance segmentation relative to similar-cost backbones?

Key findings

Model	#Params (M)	FLOPs (G)	Throughput (images/s)	Top-1 (%)	Top-5 (%)
ResT-Lite	10.49	1.4	1246	77.2 (↑7.5)	93.7 (↑4.6)
ResT-Small	13.66	1.9	1043	79.6 (↑9.9)	94.9 (↑5.8)
ResT-Base	30.28	4.3	673	81.6 (↑2.6)	95.7 (↑1.3)
ResT-Large	51.63	7.9	429	83.6 (↑3.3)	96.3 (↑1.1)

ResT-Small achieves 79.6% Top-1 accuracy on ImageNet-1k with 1.9G FLOPs and 13.66M parameters.
ResT-Large reaches 83.6% Top-1 accuracy with 7.9G FLOPs and 51.63M parameters, outperforming similarly priced Swin variants.
On COCO object detection with RetinaNet, ResT-Small improves AP by 3.6 points over PVT-T (40.3 vs 36.7).
On COCO object detection with RetinaNet, ResT-Base improves AP by 1.6 points over PVT-S (42.0 vs 40.4).
ResT-Large delivers strong gains in Mask RCNN-based instance segmentation (APbox 41.6, APmask 38.7) compared to PVT-S and Swin variants at similar budgets.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.