[论文解读] Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity
Sparse DETR 使用可学习的准则对编码器令牌进行稀疏化以减少计算,在显著提高速度的同时获得与之相近甚至更好的 AP。包括在 COCO 上仅使用 10% 的令牌并且相比 Deformable DETR 有显著提升的结果。
DETR is the first end-to-end object detector using a transformer encoder-decoder architecture and demonstrates competitive performance but low computational efficiency on high resolution feature maps. The subsequent work, Deformable DETR, enhances the efficiency of DETR by replacing dense attention with deformable attention, which achieves 10x faster convergence and improved performance. Deformable DETR uses the multiscale feature to ameliorate performance, however, the number of encoder tokens increases by 20x compared to DETR, and the computation cost of the encoder attention remains a bottleneck. In our preliminary experiment, we observe that the detection performance hardly deteriorates even if only a part of the encoder token is updated. Inspired by this observation, we propose Sparse DETR that selectively updates only the tokens expected to be referenced by the decoder, thus help the model effectively detect objects. In addition, we show that applying an auxiliary detection loss on the selected tokens in the encoder improves the performance while minimizing computational overhead. We validate that Sparse DETR achieves better performance than Deformable DETR even with only 10% encoder tokens on the COCO dataset. Albeit only the encoder tokens are sparsified, the total computation cost decreases by 38% and the frames per second (FPS) increases by 42% compared to Deformable DETR. Code is available at https://github.com/kakaobrain/sparse-detr
研究动机与目标
- Motivate reducing computational bottlenecks in end-to-end DETR-based detectors by sparsifying encoder tokens.
- Propose learnable criteria to select salient encoder tokens.
- Show that auxiliary encoder losses stabilize training and improve accuracy.
- Demonstrate improved efficiency and performance over Deformable DETR on COCO with multi-scale features.
提出的方法
- Introduce a saliency-based, learnable token sparsification scheme for the encoder.
- Use a scoring network to predict Decoder cross-Attention Map (DAM) as saliency signals.
- Define a top-rho sparsified token set for encoder updates per layer.
- Apply an encoder auxiliary loss to selected tokens to stabilize training and boost performance.
- Adopt top-k decoder queries derived from encoder outputs to refine predictions.
- Evaluate with Swin-T and ResNet-50 backbones on COCO 2017 val, comparing against DETR, Deformable DETR, PnP-DETR, and Faster R-CNN-FPN.
实验结果
研究问题
- RQ1Can encoder token sparsification in DETR-based detectors reduce computation without sacrificing detection accuracy?
- RQ2Do saliency criteria based on objectness and decoder cross-attention (DAM) better identify tokens to update than random or objectness alone?
- RQ3Does an encoder auxiliary loss improve convergence and allow deeper encoder stacks under sparsity?
- RQ4How does Sparse DETR perform with multi-scale features (e.g., Swin-T) compared to Deformable DETR under varied sparsity levels?
主要发现
| Method | Epochs | Keeping ratio (rho) | Top-k & BBR | AP | AP50 | AP75 | AP_S | AP_M | AP_L | params | FLOPs | FPS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| F-RCNN-FPN † | 109 | N/A | 42.0 | 62.1 | 45.5 | 26.6 | 45.4 | 53.4 | 42M | 180G | 26 | |
| DETR † | 500 | 100% | 42.0 | 62.4 | 44.2 | 20.5 | 45.8 | 61.1 | 41M | 86G | 28 | |
| DETR-DC5 † | 500 | 100% | 43.3 | 63.1 | 45.9 | 22.5 | 47.3 | 61.1 | 41M | 187G | 12 | |
| PnP-DETR ‡ | 500 | 33% | 41.1 | 61.5 | 43.7 | 20.8 | 44.6 | 60.0 | - | - | - | |
| PnP-DETR-DC5 ‡ | 500 | 33% | 42.7 | 62.8 | 45.1 | 22.4 | 46.2 | 60 | - | - | - | |
| Deformable-DETR | 50 | 100% | 43.9 | 62.8 | 47.8 | 26.1 | 47.4 | 58.0 | 40M | 173G | 19.1 | |
| Deformable-DETR | 50 | 100% | ✓ | 46.0 | 65.2 | 49.8 | 28.2 | 49.1 | 61.0 | 41M | 177G | 18.2 |
| Sparse-DETR | 50 | 10% | ✓ | 45.3 | 65.8 | 49.3 | 28.4 | 48.3 | 60.1 | 41M | 105G | 25.3 |
| Sparse-DETR | 50 | 20% | ✓ | 45.6 | 65.8 | 49.6 | 28.5 | 48.6 | 60.4 | 41M | 113G | 24.8 |
| Sparse-DETR | 50 | 30% | ✓ | 46.0 | 65.9 | 49.7 | 29.1 | 49.1 | 60.6 | 41M | 121G | 23.2 |
| Sparse-DETR | 50 | 40% | ✓ | 46.2 | 66.0 | 50.3 | 28.7 | 49.0 | 61.4 | 41M | 128G | 21.8 |
| Sparse-DETR | 50 | 50% | ✓ | 46.3 | 66.0 | 50.1 | 29.0 | 49.5 | 60.8 | 41M | 136G | 20.5 |
| Swin-T DETR | 500 | 100% | 45.4 | 66.2 | 48.1 | 22.9 | 49.5 | 65.9 | 45M | 92G | 26.8 | |
| Swin-T Deformable-DETR | 50 | 100% | 45.7 | 65.3 | 49.9 | 26.9 | 49.4 | 61.2 | 40M | 180G | 15.9 | |
| Swin-T Deformable-DETR | 50 | 100% | ✓ | 48.0 | 68.0 | 52.0 | 30.3 | 51.4 | 63.7 | 41M | 185G | 15.4 |
| Swin-T Sparse-DETR | 50 | 10% | ✓ | 48.2 | 69.2 | 52.3 | 29.8 | 51.2 | 64.5 | 41M | 113G | 21.2 |
| Swin-T Sparse-DETR | 50 | 20% | ✓ | 48.8 | 69.4 | 53.0 | 30.4 | 51.9 | 64.8 | 41M | 121G | 20.0 |
| Swin-T Sparse-DETR | 50 | 30% | ✓ | 49.1 | 69.5 | 53.5 | 31.4 | 52.5 | 65.1 | 41M | 129G | 18.9 |
| Swin-T Sparse-DETR | 50 | 40% | ✓ | 49.2 | 69.5 | 53.5 | 31.4 | 52.9 | 64.8 | 41M | 136G | 18.0 |
| Swin-T Sparse-DETR | 50 | 50% | ✓ | 49.3 | 69.5 | 53.3 | 32.0 | 52.7 | 64.9 | 41M | 144G | 17.2 |
- Sparse DETR achieves competitive AP with significantly reduced computation, including 38% lower FLOPs and 42% higher FPS versus Deformable DETR.
- Using only 10% encoder tokens with DAM-based sparsification, Sparse DETR outperforms many baselines and rivals Deformable DETR+ on Swin-T backbones.
- DAM-based token selection consistently outperforms Objectness Score (OS) and random sampling across backbones and sparsity levels.
- Encoder auxiliary loss enables deeper encoders (e.g., 12 layers) with stable training and improved detection performance.
- Dynamic sparsification during inference maintains robust performance across keeping-ratio settings, outperforming similar dynamic strategies in some baselines.
- With Swin-T backbone, Sparse DETR at 10% encoder tokens yields substantial efficiency gains (12-82% token-level reduction) while preserving or improving AP, especially at larger object scales.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。