QUICK REVIEW

[論文レビュー] ViDT: An Efficient and Effective Fully Transformer-based Object Detector

Hwanjun Song, Sun Deqing|arXiv (Cornell University)|Oct 8, 2021

Advanced Neural Network Applications参考文献 20被引用数 46

ひとこと要約

ViDT は Reconfigured Attention Module (RAM) を使って Swin Transformer を再構成し、エンコーダー不要の neck を用い、トークンマッチングによる知識蒸留を導入して、COCO で強力な AP と好ましいレイテンシを実現する完全に Transformer ベースの物体検出器を提案します。

ABSTRACT

Transformers are transforming the landscape of computer vision, especially for recognition tasks. Detection transformers are the first fully end-to-end learning systems for object detection, while vision transformers are the first fully transformer-based architecture for image classification. In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector, followed by a computationally efficient transformer decoder that exploits multi-scale features and auxiliary techniques essential to boost the detection performance without much increase in computational load. Extensive evaluation results on the Microsoft COCO benchmark dataset demonstrate that ViDT obtains the best AP and latency trade-off among existing fully transformer-based object detectors, and achieves 49.2AP owing to its high scalability for large models. We will release the code and trained models at https://github.com/naver-ai/vidt

研究の動機と目的

視覚モデルと検出トランスフォーマーの統合を動機づけ、重い neck エンコーダを持たない完全なエンドツーエンド検出器を構築する。
RAMを開発し、 ViT/ViT風のバックボーン（例: Swin）がマルチスケール機能を備えたスタンドアロン検出器として機能できるようにする。
エンコーダフリーの neck を用いて計算オーバーヘッドを削減し、補助デコード損失と反復的なボックス精練を活用する。
大規模と小規模の ViDT モデル間のトークンマッチングによる知識蒸留で効率性を高める。

提案手法

Swin のパラメータを再利用しつつ、グローバルアテンションを PATCH×PATCH、DET×DET、DET×PATCH のアテンションに分解する Reconfigured Attention Module (RAM) を導入する。
重い neck エンコーダを必要としない多段階特徴を統合する deformable transformer デコーダからなる encoder-free neck を採用する。
補助デコード損失と反復的なボックス改良を適用して訓練収束と予測品質を改善する。
教師モデルと生徒モデル間のトークンマッチングを用いた知識蒸留を実装し、表現知識を伝達する。
最後の Swin ステージのみで有効化して DET×PATCH の複雑さを削減するため、選択的なクロスアテンションを使用する。

実験結果

リサーチクエスチョン

RQ1アテンションを再構成し neck エンコーダを排除することで、完全な Transformer ベースの検出器が COCO で競争力のある AP/遅延を達成できるか？
RQ2RAM は DETR 的デコードと Swin 的バックボーンの効果的な統合を実現し、拡張性と速度を維持できるか？
RQ3補助デコード損失、反復的なボックス改良、トークンマッチング蒸留が検出性能に与える影響はどの程度か？

主な発見

Method	Backbone	Epochs	AP	AP50	AP75	APS	APM	APL	Params	FPS (batch=1)
DETR	DeiT-tiny	50	30.0	49.2	30.5	9.9	30.8	50.6	24M	10.9 (13.1)
DETR	DeiT-small	50	32.4	52.5	33.2	11.3	33.5	53.7	39M	7.8 (8.8)
DETR	DeiT-base	50	37.1	59.2	38.4	14.7	39.4	52.9	0.1B	4.3 (4.9)
DETR	Swin-nano	50	27.8	47.5	27.4	9.0	29.2	44.9	24M	24.7 (46.1)
DETR	Swin-tiny	50	34.1	55.1	35.3	12.7	35.9	54.2	45M	19.3 (28.1)
DETR	Swin-small	50	37.6	59.0	39.0	15.9	40.1	58.9	66M	13.5 (17.7)
DETR	Swin-base	50	40.7	62.9	42.7	18.3	44.1	62.4	0.1B	9.7 (12.6)
Deformable DETR	DeiT-tiny	50	40.8	60.1	43.6	21.4	43.4	58.2	18M	12.4 (16.3)
Deformable DETR	DeiT-small	50	43.6	63.7	46.5	23.3	47.1	62.1	35M	8.5 (10.2)
Deformable DETR	DeiT-base	50	46.4	67.3	49.4	26.7	50.1	65.4	0.1B	4.4 (5.3)
Deformable DETR	Swin-nano	50	43.1	61.4	46.3	25.9	45.2	59.4	18M	7.0 (7.8)
Deformable DETR	Swin-tiny	50	47.0	66.8	50.8	28.1	49.8	63.9	39M	6.3 (7.0)
Deformable DETR	Swin-small	50	49.0	68.9	52.9	30.3	52.8	66.6	60M	5.5 (6.1)
Deformable DETR	Swin-base	50	51.4	71.7	56.2	34.5	55.1	67.5	0.1B	4.8 (5.4)
YOLOS	DeiT-tiny	150	30.4	48.6	31.1	12.4	31.8	48.2	6M	28.1 (31.3)
YOLOS	DeiT-small	150	36.1	55.7	37.6	15.6	38.4	55.3	30M	9.3 (11.8)
YOLOS	DeiT-base	150	42.0	62.2	44.5	19.5	45.3	62.1	0.1B	3.9 (5.4)
ViDT (w.o. Neck)	Swin-nano	150	28.7	48.6	28.5	12.3	30.7	44.1	7M	36.5 (64.4)
ViDT (w.o. Neck)	Swin-tiny	150	36.3	56.3	37.8	16.4	39.0	54.3	29M	28.6 (32.1)
ViDT (w.o. Neck)	Swin-small	150	41.6	62.7	43.9	20.1	45.4	59.8	52M	16.8 (18.8)
ViDT (w.o. Neck)	Swin-base	150	43.2	64.2	45.9	21.9	46.9	63.2	91M	11.5 (12.5)
ViDT	Swin-nano	50	40.4	59.6	43.3	23.2	42.5	55.8	16M	20.0 (45.8)
ViDT	Swin-tiny	50	44.8	64.5	48.7	25.9	47.6	62.1	38M	17.2 (26.5)
ViDT	Swin-small	50	47.5	67.7	51.4	29.2	50.7	64.8	61M	12.1 (16.5)
ViDT	Swin-base	50	49.2	69.4	53.1	30.6	52.6	66.9	0.1B	9.0 (11.6)

RAMと encoder-free neck を備えた ViDT は、COCO における完全に Transformer ベースの検出器の中で最高の AP–FPS のトレードオフを達成する。
ViDT は大規模 ViT バックボーン（例: Swin-base）へのスケーリングが良好で、比較的低い遅延で高い AP を達成する（例: Swin-base 0.1B パラメータで 49.2 AP）。
クロスアテンション DET×PATCH は、最後の Swin ステージで有効化されたとき最も効果的で、AP と FPS のバランスを取る。
補助デコード損失と反復的なボックス改良は DETR 風検出器を改善し、 neck デコーダと組み合わせると特に有益だが、 neck-free バリアントでは有益でないか、むしろ害になる。
トークンマッチングを介した知識蒸留（教師-生徒 ViDT）は小型モデルに AP の利得をもたらし、大きな教師モデルはより明確な利点を提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。