QUICK REVIEW

[论文解读] Are Transformers universal approximators of sequence-to-sequence functions?

Chulhee Yun, Srinadh Bhojanapalli|arXiv (Cornell University)|Dec 20, 2019

Neural Networks and Applications参考文献 25被引用 74

一句话总结

本文证明 Transformers 是对具有紧凑支集的连续置换等变序列到序列函数的通用近似器，并且在可训练的位置编码下，可以在一个紧凑域上近似任何连续的序列到序列函数；它澄清了自注意力与前馈层的不同作用，并探索了更简单的上下文映射架构。

ABSTRACT

Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using positional encodings, we circumvent the restriction of permutation equivariance, and show that Transformer models can universally approximate arbitrary continuous sequence-to-sequence functions on a compact domain. Interestingly, our proof techniques clearly highlight the different roles of the self-attention and the feed-forward layers in Transformers. In particular, we prove that fixed width self-attention layers can compute contextual mappings of the input sequences, playing a key role in the universal approximation property of Transformers. Based on this insight from our analysis, we consider other simpler alternatives to self-attention layers and empirically evaluate them.

研究动机与目标

对 Transformer 网络在序列到序列映射中的表达能力给出形式化理解。
刻画 Transformer 在置换等变性约束下可通用近似的函数类别。
证明位置编码消除了置换约束，将通用性扩展到紧凑域上的任意连续序列到序列函数。
形式化上下文映射并证明自注意力可以实现它们。
评估实现上下文映射的替代架构并评估经验表现。

提出的方法

定义函数类 F_PE：具有紧凑支集的连续置换等变序列到序列函数。
证明定理 2：固定宽度 (h=2, m=1, r=4) 的 Transformers 能普遍近似任意 f ∈ F_PE。
引入可训练的位置编码并证明定理 3：具有位置编码的 Transformers 能普遍近似任意 f ∈ F_CD（紧凑域上的连续函数）。
形式化上下文映射并证明自注意力层可以实现它们（引理 6）。
给出一个三步的普遍性证明提要：(i) 用分段常数近似连续函数，(ii) 用修改后的 Transformers 近似这些函数，(iii) 再用标准架构近似修改后的 Transformers。
讨论自注意力（上下文映射）与前馈层（值映射）在普遍近似论证中的不同角色。

实验结果

研究问题

RQ1在参数在标记之间共享的情况下，Transformer 网络能表示哪些类别的序列到序列函数？
RQ2Transformers 是否能够普遍近似连续的置换等变序列到序列函数，且位置编码能否将这一能力扩展到紧凑域上的任意连续序列到序列函数？
RQ3上下文映射在实现普遍近似中的作用是什么，是否存在可实现这些映射的替代架构？
RQ4自注意力与前馈组件如何共同提高近似能力，是否有更简单的层可以替代自注意力而不损失普遍性？

主要发现

Transformer 块具有置换等变性特征，结合固定参数共享，能够近似任何具有紧凑支集的连续置换等变序列到序列函数（定理 2）。
在可训练的位置编码下，Transformers 可以普遍近似紧凑域上的任意连续序列到序列函数（定理 3）。
自注意力层可以实现上下文映射，使得输出可以按标记逐个依赖于完整的输入上下文（引理 6 及相关讨论）。
前馈层在标记维度上进行映射，将上下文表示映射到期望的输出值，在与上下文映射结合时实现普遍近似（命题/ 引理链条）。
一个三步证明展示了如何通过分段常数伪函数近似任意函数，然后通过修改后的 Transformers 近似，最后再通过标准 Transformer 近似（第 3 节及附录）。
作者探索了替代的上下文映射架构（例如双线性投影、可分离卷积），并在将它们与 Transformers 结合时报告了经验改进。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。