[Paper Review] Exploring Plain Vision Transformer Backbones for Object Detection
The paper investigates using plain (non-hierarchical) ViT backbones for object detection, showing competitive results with minimal fine-tuning adaptations and MAE pre-training, including a 61.3 AP box on COCO with ViT-H and ImageNet-1K pre-training.
We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors. Code for ViTDet is available in Detectron2.
Motivation & Objective
- Decouple backbone design from detection-specific modules to enable plain ViT backbones to be fine-tuned for detection.
- Show that a simple, non-hierarchical backbone can support multi-scale detection without traditional FPNs.
- Demonstrate minimal adaptations—window attention and a simple feature pyramid—are sufficient for strong performance.
- Compare plain-backbone detectors with leading hierarchical backbones (Swin, MViT) under fair conditions.
- Highlight the benefits of MAE pre-training for plain ViT backbones in detection tasks.
Proposed method
- Use plain ViT backbones (ViT-B/L/H) pre-trained with Masked Autoencoder (MAE) on ImageNet-1K.
- Build a simple feature pyramid from the last feature map of the plain backbone to enable multi-scale detection without FPN-style hierarchical backbones.
- Apply window-based self-attention during fine-tuning with a small number of cross-window propagation blocks (global attention or convolutions).
- Fine-tune detectors (Mask R-CNN / Cascade Mask R-CNN) on COCO with ImageNet-1K MAE pre-training, using standard detection heads.
- Compare four backbone adaptation strategies (none, conv propagation, global propagation, and various propagation placements) for performance and efficiency.
- Evaluate across COCO and LVIS datasets, including comparisons with Swin and MViT hierarchical backbones.
Experimental results
Research questions
- RQ1Can a plain, non-hierarchical ViT backbone be effectively fine-tuned for multi-scale object detection without renouncing detection performance?
- RQ2What minimal adaptations to a plain ViT backbone suffice to achieve competitive detection results (feature pyramid, window attention, cross-window propagation)?
Key findings
- Plain backbones with a simple feature pyramid beat the baseline without a pyramid by up to 3.4 AP on COCO.
- Window attention with a few cross-window propagation blocks is sufficient for good accuracy on detection tasks.
- MAE pre-training on IN-1K yields substantial gains for ViT backbones in detection (e.g., +3.1 AP for ViT-B, +4.6 AP for ViT-L).
- ViTDet with MAE pre-training can achieve competitive results with hierarchical backbones and, for large models, can outperform some hierarchical methods.
- ViT-H with MAE pre-training reaches 61.3 AP box on COCO, matching strong detection performance with a plain backbone.
- Plain-backbone detectors show favorable scaling and faster wall-clock performance compared to some hierarchicalBackbone methods.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.