QUICK REVIEW

[论文解读] Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth

Yihe Dong, Jean-Baptiste Cordonnier|arXiv (Cornell University)|Mar 5, 2021

Neural Networks and Applications参考文献 36被引用 71

一句话总结

Pure self-attention networks (SANs) 在深度下以指数级速度收敛到一个秩-1的输出，除非有跳过连接或 MLP 的制约；作者引入路径分解来分析 SAN，并在标准 Transformer 架构上验证发现。

ABSTRACT

Attention-based architectures have become ubiquitous in machine learning, yet our understanding of the reasons for their effectiveness remains limited. This work proposes a new way to understand self-attention networks: we show that their output can be decomposed into a sum of smaller terms, each involving the operation of a sequence of attention heads across layers. Using this decomposition, we prove that self-attention possesses a strong inductive bias towards "token uniformity". Specifically, without skip connections or multi-layer perceptrons (MLPs), the output converges doubly exponentially to a rank-1 matrix. On the other hand, skip connections and MLPs stop the output from degeneration. Our experiments verify the identified convergence phenomena on different variants of standard transformer architectures.

研究动机与目标

理解基于注意力的架构为何在经验性能之外仍然有效。
表征深度自注意力网络的秩收敛行为。
识别在 Transformer 中缓解秩收敛的架构机制。
提供基于路径的分解以分析 SAN，并在常见模型上通过实验进行验证。

提出的方法

将 SAN 输出分解为一组路径的和，每条路径是在跨层的头之间的序列。
证明纯 SAN 会收敛到行向量相同的秩-1 输出；量化收敛速率（对双指数、对单头路径为三次方速率）。
推导出体系结构变体（跳过连接、MLP、层归一化），并重新推导收敛界来研究抵消力量。
使用路径分解将 SAN 建模为浅层网络的集成并分析它们的秩行为。
在 BERT、ALBERT、XLNet 上经验性验证秩收敛现象并可视化路径效应。

实验结果

研究问题

RQ1纯自注意力是否会随着深度增加而导致秩收敛？
RQ2跳过连接和/或 MLP 模块如何影响 Transformer 架构中的秩收敛？
RQ3路径长度在贡献网络表达能力方面的作用是什么？
RQ4基于路径的分解能否解释 SAN 的观测性归纳偏置？

主要发现

在没有跳过连接或 MLP 的情况下，SAN 会以双指数（每条路径为三次方速率）的速率收敛到具有相同行的秩-1 输出。
跳过连接使路径多样化并显著缓解秩收敛，能够保留非平凡的残差。
MLP 通过提高 Lipschitz 常数来放慢收敛到秩-1，从而与自注意力形成拉锯。
层归一化并不能缓解秩收敛。
在 BERT、ALBERT、XLNet 的实验确认了纯 SAN 的秩收敛，并显示跳过连接的缓解作用；路径长度分析表明短路径携带了大部分表达能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。