[论文解读] Toward Transformer-Based Object Detection
ViT-FRCNN 表明将视觉 Transformer 主干网络与 Faster R-CNN 风格的检测器结合使用,可以在 COCO 检测任务中实现具有竞争力的检测结果并在域外泛化方面表现更好,凸显大规模预训练对检测任务的益处。
Transformers have become the dominant model in natural language processing, owing to their ability to pretrain on massive amounts of data, then transfer to smaller, more specific tasks via fine-tuning. The Vision Transformer was the first major attempt to apply a pure transformer model directly to images as input, demonstrating that as compared to convolutional networks, transformer-based architectures can achieve competitive results on benchmark classification tasks. However, the computational complexity of the attention operator means that we are limited to low-resolution inputs. For more complex tasks such as detection or segmentation, maintaining a high input resolution is crucial to ensure that models can properly identify and reflect fine details in their output. This naturally raises the question of whether or not transformer-based architectures such as the Vision Transformer are capable of performing tasks other than classification. In this paper, we determine that Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results. The model that we propose, ViT-FRCNN, demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance. We also investigate improvements over a standard detection backbone, including superior performance on out-of-domain images, better performance on large objects, and a lessened reliance on non-maximum suppression. We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
研究动机与目标
- 证明 Vision Transformer 主干可以与检测头结合用于进行目标检测。
- 评估 Transformer 主干对 COCO 上的检测性能与泛化的影响。
- 研究影响检测迁移的预训练策略和架构调整。
- 分析空间分辨率、中间编码器特征和残差连接如何影响检测质量。
提出的方法
- 通过将最终 Transformer 输出解读为用于检测的空间特征图,重用 Vision Transformer 主干。
- 使用带有 RPN 和 RoI 头的 Faster R-CNN 风格检测器来预测类别标签和边界框。
- 对整个 ViT-FRCNN 模型进行端到端微调,使用高分辨率输入以保留小物体的细节。
- 插值位置嵌入以在训练和推理阶段处理变化的输入尺寸和纵横比。
- 研究架构变体,包括使用中间编码器输出和添加残差块以将编码器连接到检测器。
- 在大规模图像数据集(ImageNet-21k、Annotations-1.3B、Open Images)上对主干进行预训练,并探索课程式预训练。
实验结果
研究问题
- RQ1当纯粹的 Transformer 主干与常规检测头结合时,是否能够实现具有竞争力的目标检测性能?
- RQ2输入的空间分辨率和特征图准备如何影响检测准确性,尤其是对小物体?
- RQ3大规模预训练和课程式预训练对检测迁移性能有何影响?
- RQ4使用 ViT 主干时,中间编码器特征和架构连接是否提高检测器性能?
- RQ5与基于 CNN 的检测器相比,ViT-FRCNN 对域外数据的泛化能力如何?
主要发现
| 模型 | AP | AP 50 | AP 75 | AP S | AP M | AP L |
|---|---|---|---|---|---|---|
| ResNet50-FRCNN-FPN | 36.0 | 57.7 | 38.4 | 20.8 | 40.0 | 46.2 |
| ResNet101-FRCNN-FPN | 38.8 | 59.9 | 42.0 | 22.2 | 43.0 | 50.9 |
| ViT-B/32*-FRCNN | 30.9 | 50.5 | 31.7 | 9.7 | 33.7 | 51.5 |
| ViT-B/32-FRCNN | 29.3 | 48.9 | 30.1 | 9.0 | 31.8 | 48.8 |
| ViT-B/32-FRCNN stride=0.5 | 34.5 | 53.4 | 36.8 | 15.6 | 36.9 | 52.3 |
| ViT-B/16-FRCNN | 36.6 | 56.3 | 39.3 | 17.4 | 40.0 | 55.5 |
| ViT-B/16*-FRCNN | 37.8 | 57.4 | 40.1 | 17.8 | 41.4 | 57.3 |
- ViT-FRCNN 相对于 ResNet-FRCNN-FPN 基线在 COCO AP 上具有竞争力,当使用更小的补丁大小(16x16)时,提升更显著,而非更大(32x32)。
- 将补丁大小从 32x32 降至 16x16 会带来显著的 AP 增益,特别是对小物体(AP_S)。
- 引入中间编码器输出和残差块可提升 AP,但在达到一定数量的块后收益递减。
- ViT-FRCNN 在域外数据集(ObjectNet-D)上显示出更强的泛化能力,并且从更大规模的预训练中获益,包括 Open Images V6,在某些设置下 AP 提升约 2–3 点。
- 在 Transformer 基检测器中,误检减少,尤其在 NMS 放宽时,表明对虚假框的抑制更好。
- 在 Open Images V6 上的课程式预训练相较于 ImageNet-21k 预训练,提供了额外的 AP 增益,特别是对小/中等物体。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。