QUICK REVIEW

[論文レビュー] Benchmarking Detection Transfer Learning with Vision Transformers

Yanghao Li, Saining Xie|arXiv (Cornell University)|Nov 22, 2021

Advanced Neural Network Applications参考文献 30被引用数 75

ひとこと要約

この論文はCOCOのMask R-CNNにおけるバックボーンとして五つの ViT 初期化（random, supervised ImageNet, MoCo v3, BEiT, MAE）を比較評価し、マスクを使った事前学習が最も大きな転移利得を生み出し、モデルサイズとともに拡大することを示している。

ABSTRACT

Object detection is a central downstream task used to test if pre-trained network parameters confer benefits, such as improved accuracy or training speed. The complexity of object detection methods can make this benchmarking non-trivial when new architectures, such as Vision Transformer (ViT) models, arrive. These difficulties (e.g., architectural incompatibility, slow training, high memory consumption, unknown training formulae, etc.) have prevented recent studies from benchmarking detection transfer learning with standard ViT models. In this paper, we present training techniques that overcome these challenges, enabling the use of standard ViT models as the backbone of Mask R-CNN. These tools facilitate the primary goal of our study: we compare five ViT initializations, including recent state-of-the-art self-supervised learning methods, supervised initialization, and a strong random initialization baseline. Our results show that recent masking-based unsupervised learning methods may, for the first time, provide convincing transfer learning improvements on COCO, increasing box AP up to 4% (absolute) over supervised and prior self-supervised pre-training methods. Moreover, these masking-based initializations scale better, with the improvement growing as model size increases.

研究の動機と目的

COCOとMask R-CNNを用いた物体検出/インスタンス分割における Vision Transformer バックボーンの転移学習評価プロトコルを確立する。
標準的な検出フレームワークでViTバックボーンを利用可能にするための実務上の課題を克服する。
検出タスクにおいて、複数の初期化手法（random、supervised、MoCo v3、BEiT、MAE）を系統的に比較する。

提案手法

ViTバックボーンをFPN互換のマルチスケール特徴ピラミッドに適応させ、ViTの深さ全体に配置された4つの解像度変更モジュールを介して実現する。
メモリと時間を削減するために窓付き自己注意を用い、ウィンドウ間の情報を保持するよう4つのグローバル注意ブロックを挿入する。
畳み込みの後のBN、長めのトレーニングスケジュール、LSJデータ拡張を含むMask R-CNNの構成要素を強化し、ゼロからの学習や事前学習済みのファインチューニングを可能にする。
一貫したトレーニング式（LSJ、AdamW、ウォームアップ、drop path）と、学習率、ウェイト減衰、ドロップパスに焦点を当てたハイパーパラメータ調整プロトコルを使用する。
絶対位置埋め込みと相対位置埋め込みを扱って、事前学習方法間で公正な比較を確保し、位置情報を標準化する。

実験結果

リサーチクエスチョン

RQ1Mask R-CNNのバックボーンとして使用した場合、異なるViT初期化はCOCOの物体検出とインスタンス分割にどのような影響を与えるか？
RQ2マスキングベースの事前学習法（BEiT、MAE）は、教師あり事前学習およびランダム初期化より転移学習利得を提供し、モデルサイズとともにどのように拡大するか？
RQ3検出フレームワークでViTバックボーンを競争力のある性能にするためのメモリ/時間のトレードオフとアーキテクチャの選択は何か？
RQ4位置符号化方式は初期化方法全体でファインチューニング性能にどのように影響するか？

主な発見

initialization	データ	ViT-B APbox	ViT-L APbox	ViT-B APmask	ViT-L APmask
supervised	IN1k w/ labels	47.9	49.3	42.9	43.9
random	none	48.9	50.7	43.6	44.9
MoCo v3	IN1k	47.9	49.3	42.7	44.0
BEiT	IN1k + DALL•E	49.8	53.3	44.4	47.1
MAE	IN1k	50.3	53.3	44.9	47.2

Mask R-CNN with ViT backbones trains smoothly across initialization methods and does not require gradient clipping.
From-scratch training yields up to 1.4 APbox higher than supervised ImageNet pre-training for ViT-B; gains are larger for ViT-L.
MoCo v3 underperforms random initialization on APbox and matches supervised initialization.
BEiT and MAE outperform both random and supervised pre-training by up to 2.4 APbox (ViT-B) and up to 4.0 APbox (ViT-L), with masking-based methods showing stronger scaling with model size.
Masking-based pre-training (BEiT, MAE) provide the first convincing COCO transfer gains, and gains increase as model size grows, unlike supervised or MoCo v3.
Pre-training accelerates convergence on COCO by ~4x compared to random initialization, with masking-based methods offering the largest gains in scaling.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。