QUICK REVIEW

[论文解读] Rethinking and Improving Relative Position Encoding for Vision Transformer

Kan Wu, Houwen Peng|arXiv (Cornell University)|Jul 29, 2021

Advanced Image and Video Retrieval Techniques参考文献 25被引用 24

一句话总结

本文提出了一种图像特定的相对位置编码（iRPE）方法，通过建模方向性相对距离并改进自注意力中的查询-相对位置交互，从而提升视觉Transformer的性能。该方法在ImageNet上实现了最高1.5%的top-1准确率提升，在COCO上实现了1.3%的mAP提升，且无需超参数调优，表明相对位置编码可有效替代图像分类中的绝对位置编码，同时在目标检测任务中仍具关键作用。

ABSTRACT

Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. General efficacy has been proven in natural language processing. However, in computer vision, its efficacy is not well studied and even remains controversial, e.g., whether relative position encoding can work equally well as absolute position? In order to clarify this, we first review existing relative position encoding methods and analyze their pros and cons when applied in vision transformers. We then propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE). Our methods consider directional relative distance modeling as well as the interactions between queries and relative position embeddings in self-attention mechanism. The proposed iRPE methods are simple and lightweight. They can be easily plugged into transformer blocks. Experiments demonstrate that solely due to the proposed encoding methods, DeiT and DETR obtain up to 1.5% (top-1 Acc) and 1.3% (mAP) stable improvements over their original versions on ImageNet and COCO respectively, without tuning any extra hyperparameters such as learning rate and weight decay. Our ablation and analysis also yield interesting findings, some of which run counter to previous understanding. Code and models are open-sourced at https://github.com/microsoft/Cream/tree/main/iRPE.

研究动机与目标

解决关于相对位置编码（RPE）在视觉Transformer中是否有效于绝对位置编码的争议。
设计专为2D图像数据定制的RPE方法，解决1D RPE扩展方法的局限性。
通过引入方向性相对距离和查询-RPE交互，改进视觉Transformer中空间归纳偏置的建模。
提供一种轻量级、即插即用的解决方案，在显著降低计算成本的同时提升性能。
通过实证研究明确相对位置编码在图像分类与目标检测任务中的作用。

提出的方法

提出四种专为2D图像设计的新相对位置编码方法——iRPE，显式建模水平和垂直相对距离。
引入上下文乘积机制，通过将查询特征与相对位置嵌入相结合来计算注意力权重，增强交互建模能力。
在多头注意力头之间共享RPE表，以减少参数量同时保持性能。
通过优化计算复杂度，将复杂度从O(n²d)降低至O(nkd)，其中k ≪ n，从而实现对高分辨率输入的可扩展性。
将iRPE模块直接集成到标准Transformer块中，确保兼容性与易于集成。
采用可学习的查找表实现相对位置嵌入，为高度和宽度方向上的每个相对偏移分配特定的位置向量。

实验结果

研究问题

RQ1在使用视觉Transformer进行图像分类任务时，相对位置编码能否有效替代绝对位置编码？
RQ2为何相对位置编码在不同视觉任务（如分类与目标检测）中表现出不一致的性能？
RQ3方向性相对距离建模如何影响2D视觉Transformer中的注意力模式？
RQ4在高分辨率视觉任务（如目标检测与语义分割）中应用RPE的计算影响如何？
RQ5查询-RPE交互如何影响模型捕捉局部与全局空间依赖关系的能力？

主要发现

所提出的iRPE方法在ImageNet上使DeiT-S的top-1准确率提升1.5%，在COCO上使DETR-ResNet50的mAP提升1.3%，且无需任何超参数调优。
相对位置编码可在图像分类任务中完全替代绝对位置编码，实现更优或相当的性能。
在目标检测任务中，绝对位置编码仍为必要，因其为精确的目标定位提供了关键的归纳偏置。
方向性相对距离建模显著改善了注意力模式，尤其在浅层网络中，模型更关注邻近的图像块。
结合查询-RPE交互的上下文乘积机制增强了模型捕捉局部空间结构的能力，模拟了卷积神经网络的归纳偏置。
在注意力头之间共享RPE可实现与非共享版本相当的性能，同时显著减少参数量，且准确率下降可忽略。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。