QUICK REVIEW

[论文解读] Representational Strengths and Limitations of Transformers

Clayton Sanford, Daniel Hsu|arXiv (Cornell University)|Jun 5, 2023

Stochastic Gradient Optimization Techniques被引用 12

一句话总结

论文分析变换器中自注意力的表示能力的局限性与优势，显示稀疏平均任务需要嵌入维度 m 随稀疏度 q 增长而扩展，而某些三元任务（Match3）在标准多头注意力下仍然很难，除非使用更高阶或结构化变体。

ABSTRACT

Attention layers, as commonly used in transformers, form the backbone of modern deep learning, yet there is no mathematical description of their benefits and deficiencies as compared with other architectures. In this work we establish both positive and negative results on the representation power of attention layers, with a focus on intrinsic complexity parameters such as width, depth, and embedding dimension. On the positive side, we present a sparse averaging task, where recurrent networks and feedforward networks all have complexity scaling polynomially in the input size, whereas transformers scale merely logarithmically in the input size; furthermore, we use the same construction to show the necessity and role of a large embedding dimension in a transformer. On the negative side, we present a triple detection task, where attention layers in turn have complexity scaling linearly in the input size; as this scenario seems rare in practice, we also present natural variants that can be efficiently solved by attention layers. The proof techniques emphasize the value of communication complexity in the analysis of transformers and related models, and the role of sparse averaging as a prototypical attention task, which even finds use in the analysis of triple detection.

研究动机与目标

研究注意力层相对于宽度、深度和嵌入维度的表示能力。
确定突出自注意力优势（稀疏平均）与局限性（成对交互与三元交互）的任务。
开发正式的任务基准（q-SA、Match2、Match3）以表征变换器的表达能力。
使用通信复杂性和几何构造推导注意力模型的上下界。

提出的方法

定义 q-sparse averaging（q-SA）以捕捉输入元素之间的交互模式并分析嵌入维度的要求。
证明上界，表明嵌入 m ≳ q 的注意力单元可以近似 q-SA（在有限和无限精度下）。
通过集合不相交还原建立下界，表明当 mp 太小时，任何小型注意力架构都无法近似 q-SA。
对比成对（Match2）和三元（Match3）检测任务，以评估标准自注意力表示高级交互的能力。
证明单个自注意力单元可以有效计算 Match2，而除非嵌入尺寸、头数多项式增长，否则单层多头注意力不能有效计算 Match3。
讨论三阶注意力作为有效计算 Match3 的一种方式，并猜测在没有线索的情况下更深的变换器仍然受限。

实验结果

研究问题

RQ1可带边界嵌入维度的自注意力单元是否能近似 q-sparse averaging，嵌入维度 m 如何随 q 与 N 变化？
RQ2标准变换器架构是否能够高效表示成对交互（Match2）与三元交互（Match3），以及所需的资源要求？
RQ3多头注意力在三元检测方面的极限是什么，是否高阶注意力能够绕过这些极限？
RQ4来自通信复杂性的下界如何指示变换器的表示极限？

主要发现

q-SA 可以被一个嵌入 m ≳ d' + q log N 的自注意力单元以 ε-近似（有限精度）以及 m ≳ d' + q（无限精度）。
任何拟合 q-SA 的全连接神经网络需要第一隐藏层宽度 Ω(Nd)。
任何 RNN 近似 q-SA 需要 Ω(N) 比特的隐藏状态。
单个自注意力单元可以高效计算 Match2，而除非 mp 或 m 或 H 很大（多项式(N)）否则单层多头注意力不能高效计算 Match3。
在局部性或嵌入结构假设下，变换器可以高效计算经修改的 Match3；广义的三阶注意力可以高效计算 Match3。
经验证据（附录 D）表明注意力可以用比 RNNs/MLPs 少得多的样本学习 q-SA。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。