QUICK REVIEW

[論文レビュー] Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity

Byungseok Roh, JaeWoong Shin|arXiv (Cornell University)|Nov 29, 2021

Advanced Neural Network Applications参考文献 28被引用数 89

ひとこと要約

Sparse DETRは、学習可能な基準でエンコーダのトークンを疎にし、計算を削減して、Deformable DETRより良好または同等のAPを達成する大幅なスピードアップを実現し、COCOで10%のトークン使用を含む顕著な利得を示します。

ABSTRACT

DETR is the first end-to-end object detector using a transformer encoder-decoder architecture and demonstrates competitive performance but low computational efficiency on high resolution feature maps. The subsequent work, Deformable DETR, enhances the efficiency of DETR by replacing dense attention with deformable attention, which achieves 10x faster convergence and improved performance. Deformable DETR uses the multiscale feature to ameliorate performance, however, the number of encoder tokens increases by 20x compared to DETR, and the computation cost of the encoder attention remains a bottleneck. In our preliminary experiment, we observe that the detection performance hardly deteriorates even if only a part of the encoder token is updated. Inspired by this observation, we propose Sparse DETR that selectively updates only the tokens expected to be referenced by the decoder, thus help the model effectively detect objects. In addition, we show that applying an auxiliary detection loss on the selected tokens in the encoder improves the performance while minimizing computational overhead. We validate that Sparse DETR achieves better performance than Deformable DETR even with only 10% encoder tokens on the COCO dataset. Albeit only the encoder tokens are sparsified, the total computation cost decreases by 38% and the frames per second (FPS) increases by 42% compared to Deformable DETR. Code is available at https://github.com/kakaobrain/sparse-detr

研究の動機と目的

エンドツーエンドのDETRベース検出器における計算ボトルネックを、エンコーダトークンを sparsify することで削減する動機付け。
Saliency基準を提案して salient なエンコーダトークンを選択する。
補助エンコーダ損失が訓練を安定化させ、精度を向上させることを示す。
Sparse DETRがCOCO上でDeformable DETRより多尺度特徴を用いた場合に、効率と性能が向上することを示す。

提案手法

エンコーダのサリエンシーベースで学習可能なトークンスパース化方式を導入する。
スコアリングネットワークを用いてDecoder cross-Attention Map (DAM)を saliency signalとして予測する。
各層で encoder 更新のための top-rho 疎和トークン集合を定義する。
訓練を安定化させ、性能を向上させるために選択されたトークンにEncoder補助損失を適用する。
エンコーダ出力から導かれた top-k デコーダークエリを採用して予測を refine する。
Swin-T および ResNet-50 バックボーンを用いて COCO 2017 val を評価し、DETR、Deformable DETR、PnP-DETR、Faster R-CNN-FPN と比較する。

実験結果

リサーチクエスチョン

RQ1DETRベース検出器におけるエンコーダトークンのスパース化は、検出精度を犠牲にすることなく計算を削減できるか。
RQ2オブジェクト性とデコーダ cross-attention (DAM) に基づくサリエンシー基準は、トークンを更新すべきものをランダムやオブジェクト性のみよりも良く識別するか。
RQ3エンコーダ補助損失は sparsity 下での収束を改善し、より深いエンコーダスタックを可能にするか。
RQ4Swin-T などの多尺度特徴を用いた場合、様々な sparsity レベルで Deformable DETR と比べて Sparse DETR の性能はどうなるか。

主な発見

Method	Epochs	Keeping ratio (rho)	Top-k & BBR	AP	AP50	AP75	AP_S	AP_M	AP_L	params	FLOPs	FPS
F-RCNN-FPN †	109	N/A		42.0	62.1	45.5	26.6	45.4	53.4	42M	180G	26
DETR †	500	100%		42.0	62.4	44.2	20.5	45.8	61.1	41M	86G	28
DETR-DC5 †	500	100%		43.3	63.1	45.9	22.5	47.3	61.1	41M	187G	12
PnP-DETR ‡	500	33%		41.1	61.5	43.7	20.8	44.6	60.0	-	-	-
PnP-DETR-DC5 ‡	500	33%		42.7	62.8	45.1	22.4	46.2	60	-	-	-
Deformable-DETR	50	100%		43.9	62.8	47.8	26.1	47.4	58.0	40M	173G	19.1
Deformable-DETR	50	100%	✓	46.0	65.2	49.8	28.2	49.1	61.0	41M	177G	18.2
Sparse-DETR	50	10%	✓	45.3	65.8	49.3	28.4	48.3	60.1	41M	105G	25.3
Sparse-DETR	50	20%	✓	45.6	65.8	49.6	28.5	48.6	60.4	41M	113G	24.8
Sparse-DETR	50	30%	✓	46.0	65.9	49.7	29.1	49.1	60.6	41M	121G	23.2
Sparse-DETR	50	40%	✓	46.2	66.0	50.3	28.7	49.0	61.4	41M	128G	21.8
Sparse-DETR	50	50%	✓	46.3	66.0	50.1	29.0	49.5	60.8	41M	136G	20.5
Swin-T DETR	500	100%		45.4	66.2	48.1	22.9	49.5	65.9	45M	92G	26.8
Swin-T Deformable-DETR	50	100%		45.7	65.3	49.9	26.9	49.4	61.2	40M	180G	15.9
Swin-T Deformable-DETR	50	100%	✓	48.0	68.0	52.0	30.3	51.4	63.7	41M	185G	15.4
Swin-T Sparse-DETR	50	10%	✓	48.2	69.2	52.3	29.8	51.2	64.5	41M	113G	21.2
Swin-T Sparse-DETR	50	20%	✓	48.8	69.4	53.0	30.4	51.9	64.8	41M	121G	20.0
Swin-T Sparse-DETR	50	30%	✓	49.1	69.5	53.5	31.4	52.5	65.1	41M	129G	18.9
Swin-T Sparse-DETR	50	40%	✓	49.2	69.5	53.5	31.4	52.9	64.8	41M	136G	18.0
Swin-T Sparse-DETR	50	50%	✓	49.3	69.5	53.3	32.0	52.7	64.9	41M	144G	17.2

Sparse DETR は計算を大幅に削減しつつ競争力のある AP を達成し、Deformable DETR と比較して FLOPs を38%低減、FPSを42%高くする。
DAMベースのスパース化でエンコーダトークンをわずか10%のみ使用しても、Sparse DETRは多くのベースラインを上回り、Swin-T バックボーン上では Deformable DETR+ に匹敵する。
DAMベースのトークン選択は、バックボーンおよびスパース性レベルを問わず、Objectness Score (OS) やランダムサンプリングより一貫して優れている。
エンコーダ補助損失により、より深いエンコーダ（例: 12層）でも安定訓練と検出性能の向上を実現。
推論時の動的スパース化は維持比設定全般で頑健な性能を維持し、いくつかのベースラインにおける類似の動的戦略よりも優れている。
Swin-T バックボーンでは、エンコーダトークンを10%とした Sparse DETR は大幅な効率化（トークン削減12-82%）を達成し、APを維持または向上させ、特に大きい物体スケールで顕著。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。