QUICK REVIEW

[论文解读] Conditional Positional Encodings for Vision Transformers

Xiangxiang Chu, Zhi Tian|arXiv (Cornell University)|Feb 22, 2021

Advanced Neural Network Applications参考文献 28被引用 404

一句话总结

本文提出条件位置编码，由 Position Encoding Generator (PEG) 生成，条件基于局部图像邻域，使 CPVT 能够推广到更长的输入，并在固定或可学习的绝对编码之上提升翻译等效性与整体性能。

ABSTRACT

We propose a conditional positional encoding (CPE) scheme for vision Transformers. Unlike previous fixed or learnable positional encodings, which are pre-defined and independent of input tokens, CPE is dynamically generated and conditioned on the local neighborhood of the input tokens. As a result, CPE can easily generalize to the input sequences that are longer than what the model has ever seen during training. Besides, CPE can keep the desired translation-invariance in the image classification task, resulting in improved performance. We implement CPE with a simple Position Encoding Generator (PEG) to get seamlessly incorporated into the current Transformer framework. Built on PEG, we present Conditional Position encoding Vision Transformer (CPVT). We demonstrate that CPVT has visually similar attention maps compared to those with learned positional encodings and delivers outperforming results. Our code is available at https://github.com/Meituan-AutoML/CPVT .

研究动机与目标

激发并解决视觉变换器中固定/可学习的绝对位置编码的局限性。
提出一种动态、输入条件化的位置信息编码方案（CPE），使用 PEG。
构建条件位置信息编码视觉变换器（CPVT）并展示改进的性能和泛化能力。
证明 CPE 能保留平移等价性并扩展到更高输入分辨率及下游任务。

提出的方法

引入对输入令牌局部 2-D 邻域进行条件化的 Positional Encoding Generator (PEG)。
将 PEG 实现为具有核大小 k 的 2-D 卷积，并使用适当的填充以产生 E^{B×H×W×C} 的编码。
按照 ViT/DeiT 的设计将 CPE 融入视觉变换器，形成 CPVT，包括 CPVT-Ti、CPVT-S 和 CPVT-B 变体。
探索 CPVT-GAP，用全局平均池化替代 class token 以实现平移不变的分类。
在实验中评估对更高分辨率的泛化，并与可学习的绝对编码和相对编码进行比较。
在实现更高准确性的同时，展示 PEG 的低参数和 FLOP 开销。

实验结果

研究问题

RQ1基于局部邻域条件化的条件位置编码是否能提升视觉变换器在固定或可学习的绝对编码之上的性能？
RQ2CPVT 模型是否能推广到更长的输入序列并维持平移等价性？
RQ3在不同模型规模和分辨率下，CPVT 与 PEG 的表现如何，包括使用 GAP 与 class token 的情形？
RQ4相对于标准位置编码，PEG 的参数/计算开销是多少？
RQ5CPVT 是否能提升金字塔形变换器架构及下游任务如分割和检测？

主要发现

CPVT 在 ImageNet top-1 准确率方面超越了使用固定或可学习绝对位置编码的前任视觉变换器。
PEG 引入的参数开销极小（例如 CPVT-Ti，k=3，l=1 时只有 1,728 个参数），FLOPs 影响可忽略。
CPVT 直接实现更高输入分辨率的泛化（例如 384×384 将 CPVT-Ti 的 top-1 从 224×224 的 73.4% 提升到 74.2%）。
CPVT-GAP 进一步提升性能，在他们的实验中达到视觉变换器中的领先水平（例如 CPVT-Ti-GAP 在 GAP 下 top-1 为 74.9%）。
PEG 放置在早期编码块内可获得强劲性能，常见的 0–5 个 PEG 放置能使结果达到最大化。
使用 PEG 时，CPVT 展示出平移等价性的优势以及在像 PVT、Swin 这样的金字塔架构上更好的性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。