[논문 리뷰] Benchmarking Detection Transfer Learning with Vision Transformers
이 논문은 다섯 가지 ViT 초기화(무작위, 감독된 ImageNet, MoCo v3, BEiT, MAE)를 백본으로 사용하여 COCO의 Mask R-CNN에서 벤치마크를 수행하고, 마스킹 기반의 사전 학습이 가장 강한 전이 향상을 제공하며 모델 크기에 따라 확장된다는 것을 보인다.
Object detection is a central downstream task used to test if pre-trained network parameters confer benefits, such as improved accuracy or training speed. The complexity of object detection methods can make this benchmarking non-trivial when new architectures, such as Vision Transformer (ViT) models, arrive. These difficulties (e.g., architectural incompatibility, slow training, high memory consumption, unknown training formulae, etc.) have prevented recent studies from benchmarking detection transfer learning with standard ViT models. In this paper, we present training techniques that overcome these challenges, enabling the use of standard ViT models as the backbone of Mask R-CNN. These tools facilitate the primary goal of our study: we compare five ViT initializations, including recent state-of-the-art self-supervised learning methods, supervised initialization, and a strong random initialization baseline. Our results show that recent masking-based unsupervised learning methods may, for the first time, provide convincing transfer learning improvements on COCO, increasing box AP up to 4% (absolute) over supervised and prior self-supervised pre-training methods. Moreover, these masking-based initializations scale better, with the improvement growing as model size increases.
연구 동기 및 목표
- Establish a transfer learning evaluation protocol for Vision Transformer backbones in object detection/instance segmentation using COCO and Mask R-CNN.
- Overcome practical challenges to enable ViT backbones with standard detection frameworks.
- Systematically compare multiple initialization methods (random, supervised, MoCo v3, BEiT, MAE) on detection tasks.
제안 방법
- Adapt ViT backbones to Mask R-CNN with an FPN-compatible multi-scale feature pyramid via four resolution-modifying modules placed across the ViT depth.
- Employ windowed self-attention to reduce memory/time, with four global attention blocks interleaved to preserve cross-window information.
- Upgrade Mask R-CNN components (BN after convolutions, longer training schedules, and LSJ data augmentation) to enable from-scratch or pre-trained fine-tuning.
- Use a consistent training formula (LSJ, AdamW, warmup, drop path) and a hyperparameter tuning protocol focusing on learning rate, weight decay, and drop path.
- Standardize positional information by handling absolute and relative position embeddings to ensure fair comparisons across pre-training methods.
실험 결과
연구 질문
- RQ1How do different ViT initializations affect COCO object detection and instance segmentation when used as backbones in Mask R-CNN?
- RQ2Do masking-based pre-training methods (BEiT, MAE) provide transfer learning gains over supervised pre-training and random initialization, and how do these gains scale with model size?
- RQ3What are the memory/time trade-offs and architectural choices that enable ViT backbones to perform competitively in detection frameworks?
- RQ4How do positional encoding schemes influence fine-tuning performance across initialization methods?
주요 결과
| 초기화 | 데이터 | ViT-B APbox | ViT-L APbox | ViT-B APmask | ViT-L APmask |
|---|---|---|---|---|---|
| supervised | IN1k w/ labels | 47.9 | 49.3 | 42.9 | 43.9 |
| random | none | 48.9 | 50.7 | 43.6 | 44.9 |
| MoCo v3 | IN1k | 47.9 | 49.3 | 42.7 | 44.0 |
| BEiT | IN1k + DALL•E | 49.8 | 53.3 | 44.4 | 47.1 |
| MAE | IN1k | 50.3 | 53.3 | 44.9 | 47.2 |
- Mask R-CNN with ViT backbones trains smoothly across initialization methods and does not require gradient clipping.
- From-scratch training yields up to 1.4 APbox higher than supervised ImageNet pre-training for ViT-B; gains are larger for ViT-L.
- MoCo v3 underperforms random initialization on APbox and matches supervised initialization.
- BEiT and MAE outperform both random and supervised pre-training by up to 2.4 APbox (ViT-B) and up to 4.0 APbox (ViT-L), with masking-based methods showing stronger scaling with model size.
- Masking-based pre-training (BEiT, MAE) provide the first convincing COCO transfer gains, and gains increase as model size grows, unlike supervised or MoCo v3.
- Pre-training accelerates convergence on COCO by ~4x compared to random initialization, with masking-based methods offering the largest gains in scaling.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.