QUICK REVIEW

[Paper Review] Exploring Plain Vision Transformer Backbones for Object Detection

Yanghao Li, Hanzi Mao|arXiv (Cornell University)|Mar 30, 2022

Advanced Neural Network Applications42 citations

TL;DR

The paper investigates using plain (non-hierarchical) ViT backbones for object detection, showing competitive results with minimal fine-tuning adaptations and MAE pre-training, including a 61.3 AP box on COCO with ViT-H and ImageNet-1K pre-training.

ABSTRACT

We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors. Code for ViTDet is available in Detectron2.

Motivation & Objective

Decouple backbone design from detection-specific modules to enable plain ViT backbones to be fine-tuned for detection.
Show that a simple, non-hierarchical backbone can support multi-scale detection without traditional FPNs.
Demonstrate minimal adaptations—window attention and a simple feature pyramid—are sufficient for strong performance.
Compare plain-backbone detectors with leading hierarchical backbones (Swin, MViT) under fair conditions.
Highlight the benefits of MAE pre-training for plain ViT backbones in detection tasks.

Proposed method

Use plain ViT backbones (ViT-B/L/H) pre-trained with Masked Autoencoder (MAE) on ImageNet-1K.
Build a simple feature pyramid from the last feature map of the plain backbone to enable multi-scale detection without FPN-style hierarchical backbones.
Apply window-based self-attention during fine-tuning with a small number of cross-window propagation blocks (global attention or convolutions).
Fine-tune detectors (Mask R-CNN / Cascade Mask R-CNN) on COCO with ImageNet-1K MAE pre-training, using standard detection heads.
Compare four backbone adaptation strategies (none, conv propagation, global propagation, and various propagation placements) for performance and efficiency.
Evaluate across COCO and LVIS datasets, including comparisons with Swin and MViT hierarchical backbones.

Experimental results

Research questions

RQ1Can a plain, non-hierarchical ViT backbone be effectively fine-tuned for multi-scale object detection without renouncing detection performance?
RQ2What minimal adaptations to a plain ViT backbone suffice to achieve competitive detection results (feature pyramid, window attention, cross-window propagation)?

Key findings

Plain backbones with a simple feature pyramid beat the baseline without a pyramid by up to 3.4 AP on COCO.
Window attention with a few cross-window propagation blocks is sufficient for good accuracy on detection tasks.
MAE pre-training on IN-1K yields substantial gains for ViT backbones in detection (e.g., +3.1 AP for ViT-B, +4.6 AP for ViT-L).
ViTDet with MAE pre-training can achieve competitive results with hierarchical backbones and, for large models, can outperform some hierarchical methods.
ViT-H with MAE pre-training reaches 61.3 AP box on COCO, matching strong detection performance with a plain backbone.
Plain-backbone detectors show favorable scaling and faster wall-clock performance compared to some hierarchicalBackbone methods.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.