Skip to main content
QUICK REVIEW

[Paper Review] Exploring Plain Vision Transformer Backbones for Object Detection

Yanghao Li, Hanzi Mao|arXiv (Cornell University)|Mar 30, 2022
Advanced Neural Network Applications42 citations
TL;DR

The paper investigates using plain (non-hierarchical) ViT backbones for object detection, showing competitive results with minimal fine-tuning adaptations and MAE pre-training, including a 61.3 AP box on COCO with ViT-H and ImageNet-1K pre-training.

ABSTRACT

We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors. Code for ViTDet is available in Detectron2.

Motivation & Objective

  • Decouple backbone design from detection-specific modules to enable plain ViT backbones to be fine-tuned for detection.
  • Show that a simple, non-hierarchical backbone can support multi-scale detection without traditional FPNs.
  • Demonstrate minimal adaptations—window attention and a simple feature pyramid—are sufficient for strong performance.
  • Compare plain-backbone detectors with leading hierarchical backbones (Swin, MViT) under fair conditions.
  • Highlight the benefits of MAE pre-training for plain ViT backbones in detection tasks.

Proposed method

  • Use plain ViT backbones (ViT-B/L/H) pre-trained with Masked Autoencoder (MAE) on ImageNet-1K.
  • Build a simple feature pyramid from the last feature map of the plain backbone to enable multi-scale detection without FPN-style hierarchical backbones.
  • Apply window-based self-attention during fine-tuning with a small number of cross-window propagation blocks (global attention or convolutions).
  • Fine-tune detectors (Mask R-CNN / Cascade Mask R-CNN) on COCO with ImageNet-1K MAE pre-training, using standard detection heads.
  • Compare four backbone adaptation strategies (none, conv propagation, global propagation, and various propagation placements) for performance and efficiency.
  • Evaluate across COCO and LVIS datasets, including comparisons with Swin and MViT hierarchical backbones.

Experimental results

Research questions

  • RQ1Can a plain, non-hierarchical ViT backbone be effectively fine-tuned for multi-scale object detection without renouncing detection performance?
  • RQ2What minimal adaptations to a plain ViT backbone suffice to achieve competitive detection results (feature pyramid, window attention, cross-window propagation)?

Key findings

  • Plain backbones with a simple feature pyramid beat the baseline without a pyramid by up to 3.4 AP on COCO.
  • Window attention with a few cross-window propagation blocks is sufficient for good accuracy on detection tasks.
  • MAE pre-training on IN-1K yields substantial gains for ViT backbones in detection (e.g., +3.1 AP for ViT-B, +4.6 AP for ViT-L).
  • ViTDet with MAE pre-training can achieve competitive results with hierarchical backbones and, for large models, can outperform some hierarchical methods.
  • ViT-H with MAE pre-training reaches 61.3 AP box on COCO, matching strong detection performance with a plain backbone.
  • Plain-backbone detectors show favorable scaling and faster wall-clock performance compared to some hierarchicalBackbone methods.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.