QUICK REVIEW

[论文解读] Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity

Byungseok Roh, JaeWoong Shin|arXiv (Cornell University)|Nov 29, 2021

Advanced Neural Network Applications参考文献 28被引用 89

一句话总结

Sparse DETR 使用可学习的准则对编码器令牌进行稀疏化以减少计算，在显著提高速度的同时获得与之相近甚至更好的 AP。包括在 COCO 上仅使用 10% 的令牌并且相比 Deformable DETR 有显著提升的结果。

ABSTRACT

DETR is the first end-to-end object detector using a transformer encoder-decoder architecture and demonstrates competitive performance but low computational efficiency on high resolution feature maps. The subsequent work, Deformable DETR, enhances the efficiency of DETR by replacing dense attention with deformable attention, which achieves 10x faster convergence and improved performance. Deformable DETR uses the multiscale feature to ameliorate performance, however, the number of encoder tokens increases by 20x compared to DETR, and the computation cost of the encoder attention remains a bottleneck. In our preliminary experiment, we observe that the detection performance hardly deteriorates even if only a part of the encoder token is updated. Inspired by this observation, we propose Sparse DETR that selectively updates only the tokens expected to be referenced by the decoder, thus help the model effectively detect objects. In addition, we show that applying an auxiliary detection loss on the selected tokens in the encoder improves the performance while minimizing computational overhead. We validate that Sparse DETR achieves better performance than Deformable DETR even with only 10% encoder tokens on the COCO dataset. Albeit only the encoder tokens are sparsified, the total computation cost decreases by 38% and the frames per second (FPS) increases by 42% compared to Deformable DETR. Code is available at https://github.com/kakaobrain/sparse-detr

研究动机与目标

Motivate reducing computational bottlenecks in end-to-end DETR-based detectors by sparsifying encoder tokens.
Propose learnable criteria to select salient encoder tokens.
Show that auxiliary encoder losses stabilize training and improve accuracy.
Demonstrate improved efficiency and performance over Deformable DETR on COCO with multi-scale features.

提出的方法

Introduce a saliency-based, learnable token sparsification scheme for the encoder.
Use a scoring network to predict Decoder cross-Attention Map (DAM) as saliency signals.
Define a top-rho sparsified token set for encoder updates per layer.
Apply an encoder auxiliary loss to selected tokens to stabilize training and boost performance.
Adopt top-k decoder queries derived from encoder outputs to refine predictions.
Evaluate with Swin-T and ResNet-50 backbones on COCO 2017 val, comparing against DETR, Deformable DETR, PnP-DETR, and Faster R-CNN-FPN.

实验结果

研究问题

RQ1Can encoder token sparsification in DETR-based detectors reduce computation without sacrificing detection accuracy?
RQ2Do saliency criteria based on objectness and decoder cross-attention (DAM) better identify tokens to update than random or objectness alone?
RQ3Does an encoder auxiliary loss improve convergence and allow deeper encoder stacks under sparsity?
RQ4How does Sparse DETR perform with multi-scale features (e.g., Swin-T) compared to Deformable DETR under varied sparsity levels?

主要发现

Method	Epochs	Keeping ratio (rho)	Top-k & BBR	AP	AP50	AP75	AP_S	AP_M	AP_L	params	FLOPs	FPS
F-RCNN-FPN †	109	N/A		42.0	62.1	45.5	26.6	45.4	53.4	42M	180G	26
DETR †	500	100%		42.0	62.4	44.2	20.5	45.8	61.1	41M	86G	28
DETR-DC5 †	500	100%		43.3	63.1	45.9	22.5	47.3	61.1	41M	187G	12
PnP-DETR ‡	500	33%		41.1	61.5	43.7	20.8	44.6	60.0	-	-	-
PnP-DETR-DC5 ‡	500	33%		42.7	62.8	45.1	22.4	46.2	60	-	-	-
Deformable-DETR	50	100%		43.9	62.8	47.8	26.1	47.4	58.0	40M	173G	19.1
Deformable-DETR	50	100%	✓	46.0	65.2	49.8	28.2	49.1	61.0	41M	177G	18.2
Sparse-DETR	50	10%	✓	45.3	65.8	49.3	28.4	48.3	60.1	41M	105G	25.3
Sparse-DETR	50	20%	✓	45.6	65.8	49.6	28.5	48.6	60.4	41M	113G	24.8
Sparse-DETR	50	30%	✓	46.0	65.9	49.7	29.1	49.1	60.6	41M	121G	23.2
Sparse-DETR	50	40%	✓	46.2	66.0	50.3	28.7	49.0	61.4	41M	128G	21.8
Sparse-DETR	50	50%	✓	46.3	66.0	50.1	29.0	49.5	60.8	41M	136G	20.5
Swin-T DETR	500	100%		45.4	66.2	48.1	22.9	49.5	65.9	45M	92G	26.8
Swin-T Deformable-DETR	50	100%		45.7	65.3	49.9	26.9	49.4	61.2	40M	180G	15.9
Swin-T Deformable-DETR	50	100%	✓	48.0	68.0	52.0	30.3	51.4	63.7	41M	185G	15.4
Swin-T Sparse-DETR	50	10%	✓	48.2	69.2	52.3	29.8	51.2	64.5	41M	113G	21.2
Swin-T Sparse-DETR	50	20%	✓	48.8	69.4	53.0	30.4	51.9	64.8	41M	121G	20.0
Swin-T Sparse-DETR	50	30%	✓	49.1	69.5	53.5	31.4	52.5	65.1	41M	129G	18.9
Swin-T Sparse-DETR	50	40%	✓	49.2	69.5	53.5	31.4	52.9	64.8	41M	136G	18.0
Swin-T Sparse-DETR	50	50%	✓	49.3	69.5	53.3	32.0	52.7	64.9	41M	144G	17.2

Sparse DETR achieves competitive AP with significantly reduced computation, including 38% lower FLOPs and 42% higher FPS versus Deformable DETR.
Using only 10% encoder tokens with DAM-based sparsification, Sparse DETR outperforms many baselines and rivals Deformable DETR+ on Swin-T backbones.
DAM-based token selection consistently outperforms Objectness Score (OS) and random sampling across backbones and sparsity levels.
Encoder auxiliary loss enables deeper encoders (e.g., 12 layers) with stable training and improved detection performance.
Dynamic sparsification during inference maintains robust performance across keeping-ratio settings, outperforming similar dynamic strategies in some baselines.
With Swin-T backbone, Sparse DETR at 10% encoder tokens yields substantial efficiency gains (12-82% token-level reduction) while preserving or improving AP, especially at larger object scales.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。