[論文レビュー] Are Transformers universal approximators of sequence-to-sequence functions?
この論文は、Transformersが連続的 permutation-equivariant sequence-to-sequence 関数をコンパクトサポートを持つ普遍近似器として近似できることを証明し、trainable positional encodings を用いると、compact domain 上の任意の連続な sequence-to-sequence 関数を近似できるようになることを示す。自己注意と feed-forward 層のそれぞれの役割を明確にし、より単純な contextual-mapping アーキテクチャを探る。
Despite the widespread adoption of Transformer models for NLP tasks, the expressive power of these models is not well-understood. In this paper, we establish that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Furthermore, using positional encodings, we circumvent the restriction of permutation equivariance, and show that Transformer models can universally approximate arbitrary continuous sequence-to-sequence functions on a compact domain. Interestingly, our proof techniques clearly highlight the different roles of the self-attention and the feed-forward layers in Transformers. In particular, we prove that fixed width self-attention layers can compute contextual mappings of the input sequences, playing a key role in the universal approximation property of Transformers. Based on this insight from our analysis, we consider other simpler alternatives to self-attention layers and empirically evaluate them.
研究の動機と目的
- Provide a formal understanding of the expressive power of Transformer networks for sequence-to-sequence mappings.
- Characterize the class of functions that Transformers can universally approximate under permutation-equivariance.
- Show how positional encodings remove permutation constraints and extend universality to arbitrary continuous seq-to-seq functions on compact domains.
- Formalize contextual mappings and demonstrate that self-attention can implement them.
- Evaluate alternative architectures that implement contextual mappings and assess empirical performance.
提案手法
- Define the function class F_PE of continuous permutation-equivariant sequence-to-sequence functions with compact support.
- Prove Theorem 2: Transformers with fixed width (h=2, m=1, r=4) universally approximate any f in F_PE.
- Introduce trainable positional encodings and prove Theorem 3: Transformers with positional encodings universally approximate any f in F_CD (continuous functions on a compact domain).
- Formalize contextual mappings and prove that self-attention layers can implement them (Lemma 6).
- Present a three-step proof outline for universality: (i) approximate continuous functions by piecewise-constant ones, (ii) approximate these with modified Transformers, (iii) approximate the modifiedTransformers with the standard architecture.
- Discuss the distinct roles of self-attention (contextual mapping) and feed-forward layers (value mapping) in the universal approximation argument.
実験結果
リサーチクエスチョン
- RQ1What class of sequence-to-sequence functions can Transformer networks represent given parameter sharing across tokens?
- RQ2Do Transformers universally approximate continuous permutation-equivariant sequence-to-sequence functions, and can positional encodings extend this to arbitrary continuous seq-to-seq functions on compact domains?
- RQ3What is the role of contextual mappings in enabling universal approximation, and can alternative architectures realize these mappings?
- RQ4How do self-attention and feed-forward components contribute to the approximation power, and can simpler layers substitute for self-attention without losing universality?
主な発見
- Transformer blocks are permutation-equivariant, and together with fixed parameter sharing can approximate any continuous permutation-equivariant sequence-to-sequence function with compact support (Theorem 2).
- With trainable positional encodings, Transformers can universally approximate any continuous sequence-to-sequence function on a compact domain (Theorem 3).
- Self-attention layers can implement contextual mappings, enabling token-wise outputs that depend on full input context (Lemma 6 and related discussion).
- Feed-forward layers, operating token-wise, map contextual representations to the desired output values, enabling universal approximation when combined with contextual mappings (Proposition/ Lemma chain).
- A three-step proof shows how to approximate arbitrary functions by piecewise-constant surrogates, then via modified Transformers, and finally via standard Transformers (Section 3 and Appendix).
- The authors explore alternative contextual-mapping architectures (e.g., bi-linear projections, separable convolutions) and report empirical improvements when combining them with Transformers.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。