QUICK REVIEW

[论文解读] Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet

Luke Melas-Kyriazi|arXiv (Cornell University)|May 6, 2021

Advanced Neural Network Applications参考文献 11被引用 80

一句话总结

论文用前馈层替代 Vision Transformer 的注意力在补丁维度上，并发现仅使用前馈的模型也能在 ImageNet top-1 上取得强劲的表现，暗示注意力可能并非达到竞争性能的必要条件。

ABSTRACT

The strong performance of vision transformers on image classification and other vision tasks is often attributed to the design of their multi-head attention layers. However, the extent to which attention is responsible for this strong performance remains unclear. In this short report, we ask: is the attention layer even necessary? Specifically, we replace the attention layer in a vision transformer with a feed-forward layer applied over the patch dimension. The resulting architecture is simply a series of feed-forward layers applied over the patch and feature dimensions in an alternating fashion. In experiments on ImageNet, this architecture performs surprisingly well: a ViT/DeiT-base-sized model obtains 74.9\% top-1 accuracy, compared to 77.9\% and 79.9\% for ViT and DeiT respectively. These results indicate that aspects of vision transformers other than attention, such as the patch embedding, may be more responsible for their strong performance than previously thought. We hope these results prompt the community to spend more time trying to understand why our current models are as effective as they are.

研究动机与目标

研究在 ImageNet 上注意力是否对 Vision Transformer 的性能至关重要。
评估仅使用前馈架构与 ViT/DeiT 的注意力结构相比的表现。
理解哪些组成部分对视觉变换器的强性能贡献最大。

提出的方法

在 ViT 中将注意力层替换为应用于补丁维度的前馈层。
对比公平比较，使用与 ViT/DeiT 基线相同的架构与训练方案。
在 224px 分辨率下训练 ViT/DeiT 的 tiny、base、large 配置，比较 FF-only 网络与其带注意力的对照在不同模型规模上的表现。
在相同的模型规模下，比较 FF-only 网络与带注意力的对照的性能。

实验结果

研究问题

RQ1移除注意力机制并在补丁上使用前馈层如何影响 ImageNet top-1 准确率？
RQ2哪些组成部分（补丁嵌入、训练增强）推动视觉变换器的强性能？
RQ3仅使用前馈的体系结构是否在标准 ViT/DeiT 尺寸下也能获得有竞争力的结果？

主要发现

模型	参数	ImageNet Top-1
Tiny (P=16) ViT
Tiny (P=16) DeiT	5.7M	72.2
Tiny (P=16) FF Only	7.7M	61.4
Base (P=16) ViT	86M	77.9
Base (P=16) DeiT	86M	79.9
Base (P=16) FF Only	62M	74.9
Large (P=32) ViT	306M	71.2
Large (P=32) DeiT
Large (P=32) FF Only	206M	71.4

FF-only 模型也能取得强劲的准确率，例如 base 大小的 FF-only 在 ImageNet 上达到 74.9% 的 top-1。
无注意力模型的性能虽然不及带注意力的模型，但在不同规模上仍然出乎意料地强。
Base 大小的 FF-only 模型明显比 tiny FF-only 模型更准确，但落后于带注意力的 ViT/DeiT。
在所研究的设置下，较大的 FF-only 模型相对于 base/ViT 的性能下降。
纯注意力模型（小模型）在此设置中表现较差，突显在没有 FF 组件的情况下注意力的受限益处。
训练方案与补丁嵌入对观察到的性能有贡献，而不仅仅是注意力机制。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。