QUICK REVIEW

[论文解读] Sparse R-CNN: End-to-End Object Detection with Learnable Proposals

Peize Sun, Rufeng Zhang|arXiv (Cornell University)|Nov 25, 2020

Advanced Neural Network Applications参考文献 61被引用 103

一句话总结

Sparse R-CNN 提出了一种纯稀疏对象检测器，使用固定的一组可学习提案和动态图像实例交互头，在不需要密集候选或 NMS 后处理的情况下实现与 COCO 相当的结果。

ABSTRACT

We present Sparse R-CNN, a purely sparse method for object detection in images. Existing works on object detection heavily rely on dense object candidates, such as $k$ anchor boxes pre-defined on all grids of image feature map of size $H imes W$. In our method, however, a fixed sparse set of learned object proposals, total length of $N$, are provided to object recognition head to perform classification and location. By eliminating $HWk$ (up to hundreds of thousands) hand-designed object candidates to $N$ (e.g. 100) learnable proposals, Sparse R-CNN completely avoids all efforts related to object candidates design and many-to-one label assignment. More importantly, final predictions are directly output without non-maximum suppression post-procedure. Sparse R-CNN demonstrates accuracy, run-time and training convergence performance on par with the well-established detector baselines on the challenging COCO dataset, e.g., achieving 45.0 AP in standard $3 imes$ training schedule and running at 22 fps using ResNet-50 FPN model. We hope our work could inspire re-thinking the convention of dense prior in object detectors. The code is available at: https://github.com/PeizeSun/SparseR-CNN.

研究动机与目标

挑战对现代检测器中对密集对象候选的依赖。
引入一个固定的、可学习的提案框和特征集合，用于端到端的目标检测。
通过直接预测最终对象集合来消除对NMS的需求。
在简单稀疏架构下展示在 COCO 上的竞争精度和速度。

提出的方法

用固定、可学习的一组N个提案（框）和N个提案特征，取代密集候选生成(HWk proposals)。
使用 RoIAlign 为每个可学习提案提取特征，并应用一个以相应提案特征为条件的动态实例交互头。
整合一个迭代式架构，其中来自一个阶段的细化框/特征供给下一个阶段，并使用自注意力来建模对象之间的关系。
用集合基损失进行一对一的双向匹配（bipartite），将预测和真实对象进行一对一匹配，避免多对一分配。
可选地与各种骨干网络和消融实验进行比较，以研究提案、迭代和动态头的影响。

实验结果

研究问题

RQ1是否可以仅用一组纯稀疏的可学习提案而没有密集先验或NMS来有效进行目标检测？
RQ2可学习提案框和提案特征对不同对象和尺度的检测精度有何影响？
RQ3相对于传统密集检测器和 DETR 类方法，迭代细化和动态头如何影响收敛速度和最终性能？
RQ4在 COCO 上，对骨干选择和训练设置的鲁棒性如何？

主要发现

方法	骨干/设置	Epochs	AP	AP 50	AP 75	AP s	AP m	AP l	FPS
RetinaNet-R50	FPN, 36 epochs	36	38.7	58.0	41.5	23.3	42.3	50.3	24
RetinaNet-R101	FPN, 36 epochs	36	40.4	60.2	43.2	24.0	44.3	52.2	18
Faster R-CNN-R50	FPN, 36 epochs	36	40.2	61.0	43.8	24.2	43.5	52.0	26
Faster R-CNN-R101	FPN, 36 epochs	36	42.0	62.5	45.9	25.2	45.6	54.6	20
Cascade R-CNN-R50	FPN, 36 epochs	36	44.3	62.2	48.0	26.6	47.7	57.7	19
DETR-R50	Encoder, 500 epochs	500	42.0	62.4	44.2	20.5	45.8	61.1	28
DETR-R101	Encoder, 500 epochs	500	43.5	63.8	46.4	21.9	48.0	61.8	20
DETR-DC5-R50	Encoder, 500 epochs	500	43.3	63.1	45.9	22.5	47.3	61.1	12
DETR-DC5-R101	Encoder, 500 epochs	500	44.9	64.7	47.7	23.7	49.5	62.3	10
Deformable DETR-R50	DeformEncoder, 50 epochs	50	43.8	62.6	47.7	26.4	47.1	58.0	19
Sparse R-CNN-R50	FPN, 36 epochs	36	42.8	61.2	45.7	26.7	44.6	57.6	23
Sparse R-CNN-R101	FPN, 36 epochs	36	44.1	62.1	47.2	26.1	46.3	59.7	19
Sparse R-CNN*-R50	FPN, 36 epochs	36	45.0	63.4	48.2	26.9	47.2	59.5	22
Sparse R-CNN*-R101	FPN, 36 epochs	36	46.4	64.6	49.5	28.3	48.3	61.6	18

Sparse R-CNN 实现在 COCO AP 的竞争性（例如，使用 300 提案、在 ResNet-50 FPN 上训练 36 轮时的 45.0 AP），且避免密集候选生成和 NMS。
使用 100 提案时，Sparse R-CNN 达到 42.8 AP；使用 300 提案并数据增强时达到 45.0 AP；使用 101-卡提案时报道 46.4 AP。
基于提案特征的动态图像实例交互头相较静态头和其他注意力机制提供显著的精度提升。
该方法的训练收敛速度比 DETR 更快且推理速度具有竞争力（例如，在 ResNet-50-FPN 上，100-300 提案时为 22 FPS）。
集合为基础的一对一匹配损失（双向二分匹配）取代传统的多对一分配，支持端到端训练无需后处理。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。