QUICK REVIEW

[论文解读] Refiner: Refining Self-attention for Vision Transformers

Daquan Zhou, Yujun Shi|arXiv (Cornell University)|Jun 7, 2021

Advanced Neural Network Applications参考文献 55被引用 41

一句话总结

Refiner 通过注意力扩展和分布式局部注意力直接在 Vision Transformer 的自注意力图上进行细化，提高数据效率，并以不到100M参数达到SOTA。

ABSTRACT

Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs. Yet, they generally require much more data for model pre-training. Most of recent works thus are dedicated to designing more complex architectures or training methods to address the data-efficiency issue of ViTs. However, few of them explore improving the self-attention mechanism, a key factor distinguishing ViTs from CNNs. Different from existing works, we introduce a conceptually simple scheme, called refiner, to directly refine the self-attention maps of ViTs. Specifically, refiner explores attention expansion that projects the multi-head attention maps to a higher-dimensional space to promote their diversity. Further, refiner applies convolutions to augment local patterns of the attention maps, which we show is equivalent to a distributed local attention features are aggregated locally with learnable kernels and then globally aggregated with self-attention. Extensive experiments demonstrate that refiner works surprisingly well. Significantly, it enables ViTs to achieve 86% top-1 classification accuracy on ImageNet with only 81M parameters.

研究动机与目标

通过细化自注意力机制来弥合 Vision Transformers 的数据效率差距，而不仅仅改变架构或训练技巧。
增加注意力图的多样性并引入局部模式，以应对深层 ViT 的过平滑问题。
提出一个简单、可替换的模块（refiner），可以替代 ViT 模块中的普通自注意力。
在 ImageNet 上展示改进并显示对 NLP (GLUE) 任务的泛化。
提供全局注意力与局部上下文在代币聚合中的相互作用的见解。

提出的方法

引入注意力扩展：将多头注意力图投影到更高维空间，以在不降低嵌入维度的情况下有效增加注意力图的数量。
使用线性投影 W_A 将 A 扩展为 H' 个注意力图，其中 H' > H；然后对扩展后的注意力图进行聚合，稍后通过 1x1 投影将其降回到 H。
对扩展后的注意力图应用逐头的空间卷积以增强局部模式，产生一个分布式局部注意力 (DLA) 机制。
证明 DLA 将全局上下文建模与局部模式丰富相结合，缓解过平滑并提升代币区分度。
用 Refiner 模块替换普通自注意力块，获得 Refined-ViT， ViT 块的直接增强。
展示在 DLA 之后进一步减少注意力图以控制计算成本同时保持准确性。

实验结果

研究问题

RQ1通过扩展和局部模式增强对自注意力图进行细化，是否能够提升 ViT 的数据效率和准确性？
RQ2在各种 ViT 架构中，分布式局部注意力是否比标准自注意力带来更高的收益？
RQ3扩展再降维对模型性能和收敛速度有何影响？
RQ4Refiner 的收益是否可以迁移到 NLP 变换器（如 BERT）以及其他视觉-语言或 NLP 基准上？

主要发现

Refiner 在相同训练方案下将 ViT-Base 在 ImageNet 的 top-1 准确率提升 1.7%，且内存开销可忽略。
单独的注意力扩展随着扩展比从 1 提升到 6，top-1 从 82.3% 提升到 83.0%，并且收敛更快。
Distributed Local Attention (DLA) 在 ViT 的变体中持续将 top-1 提升 1.2% 至 1.7%，且尺寸增幅很小。
Refined-ViT-S 在 ImageNet 上达到 83.6% 的 top-1（25M 参数），在相同设置下比 DeiT-S 高出 3.7%。
Refined-ViT-M 在 384 输入上达到 85.6% top-1（384-dim，55M 参数），比 CaiT-S36 高出 0.2% 且计算量更少；Refined-ViT-448 在不到 100M 参数下达到 86%，在此类模型中创造新 SOTA。
应用 RFC（receptive field calibration）在若干 SOTA 模型上无需微调即可进一步将 ImageNet top-1 提升约 0.11%，而 Refiner 的收益也扩展到 NLP (GLUE) 任务，使平均分比强基线高约 1%。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。