QUICK REVIEW

[论文解读] ViDT: An Efficient and Effective Fully Transformer-based Object Detector

Hwanjun Song, Sun Deqing|arXiv (Cornell University)|Oct 8, 2021

Advanced Neural Network Applications参考文献 20被引用 46

一句话总结

ViDT提出了一种基于Transformer的完全目标检测器，通过将Swin Transformer重新配置为带有重构注意力模块RAM的结构、使用无编码器颈部，并通过令牌匹配进行知识蒸馏，在COCO上实现具竞争力的AP与较优的延迟。

ABSTRACT

Transformers are transforming the landscape of computer vision, especially for recognition tasks. Detection transformers are the first fully end-to-end learning systems for object detection, while vision transformers are the first fully transformer-based architecture for image classification. In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector, followed by a computationally efficient transformer decoder that exploits multi-scale features and auxiliary techniques essential to boost the detection performance without much increase in computational load. Extensive evaluation results on the Microsoft COCO benchmark dataset demonstrate that ViDT obtains the best AP and latency trade-off among existing fully transformer-based object detectors, and achieves 49.2AP owing to its high scalability for large models. We will release the code and trained models at https://github.com/naver-ai/vidt

研究动机与目标

将视觉与检测Transformer整合以构建一个端到端的完全检测器，避免使用繁重的颈部编码器。
开发RAM，使ViT/ViT类骨干（如Swin）能够作为单独检测器，具备多尺度特征。
通过使用无编码器的颈部并利用辅助解码损失与迭代框精细化来降低计算开销。
通过令牌匹配的知识蒸馏来提升效率，将大模型与小ViDT模型之间的表示知识进行传递。

提出的方法

引入重构注意力模块（RAM），将全局注意力分解为PATCH×PATCH、DET×DET和DET×PATCH注意力，同时复用Swin参数。
采用无编码器的颈部，由变形Transformer解码器组成以在不使用重型颈部编码器的情况下融合多尺度特征。
应用辅助解码损失和迭代框精细化以改善训练收敛性和预测质量。
实现教师–学生ViDT模型之间的令牌匹配知识蒸馏以传递表征知识。
通过选择性跨注意力在最后一个Swin阶段激活来减少DET×PATCH的复杂度。

实验结果

研究问题

RQ1通过重新配置注意力并移除颈部编码器，完全基于Transformer的检测器是否能够在COCO上实现具有竞争力的AP/延迟？
RQ2RAM是否能够在保持可扩展性和速度的同时有效地将DETR风格解码与Swin风格骨干集成？
RQ3辅助解码损失、迭代框精细化和令牌匹配蒸馏对检测性能有何影响？

主要发现

在COCO上，配备RAM并使用无编码器颈部的ViDT在完全基于Transformer的检测器中实现了最佳的AP–FPS权衡。
ViDT对大型ViT骨干（如Swin-base）具有良好扩展性，在相对较低的延迟下实现较高的AP（例如使用Swin-base时达到49.2 AP，参数量为0.1B）。
跨注意力的DET×PATCH在最后一个Swin阶段激活时最具效果，能够在AP与FPS之间实现平衡。
辅助解码损失和迭代框精细化提升了DETR风格检测器的性能，在与颈部解码器配合使用时尤为有利；在无颈部变体中收益较小甚至有害。
通过令牌匹配的知识蒸馏（教师–学生ViDT）为较小模型带来AP提升，且较大的教师提供更明确的收益。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。