QUICK REVIEW

[論文レビュー] Random Feature Attention

Hao Peng, Nikolaos Pappas|arXiv (Cornell University)|Mar 3, 2021

Topic Modeling参考文献 70被引用数 122

ひとこと要約

Rfa は softmax アテンションを、線形時間・線形空間のランダム特徴近似に置き換え、最近性バイアスのためのオプションゲートを備え、強力なトランスフォーマーと同等またはそれ以上の性能を達成し、MT でのデコードをより高速化します。

ABSTRACT

Transformers are state-of-the-art models for a variety of sequence modeling tasks. At their core is an attention function which models pairwise interactions between the inputs at every timestep. While attention is powerful, it does not scale efficiently to long sequences due to its quadratic time and space complexity in the sequence length. We propose RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, and explore its application in transformers. RFA can be used as a drop-in replacement for conventional softmax attention and offers a straightforward way of learning with recency bias through an optional gating mechanism. Experiments on language modeling and machine translation demonstrate that RFA achieves similar or better performance compared to strong transformer baselines. In the machine translation experiment, RFA decodes twice as fast as a vanilla transformer. Compared to existing efficient transformer variants, RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets. Our analysis shows that RFA's efficiency gains are especially notable on long sequences, suggesting that RFA will be particularly useful in tasks that require working with large inputs, fast decoding speed, or low memory footprints.

研究の動機と目的

長いシーケンスに対する scalable attention の動機付け。
softmax 注意機構の線形時間・線形空間代替として Random Feature Attention (Rfa) を提案。
学習時に再来性バイアスを導入するための任意のゲーティング機構を組み込む。
Rfa の有効性を言語モデル、機械翻訳、および長文分類で示す。

提案手法

random feature map φ を用いて exp(q·k/σ^2) の偏りのないカーネル近似を導出し、softmax 注意を近似する。
φ(q)ᵀ S / (φ(q)· z) として注意を再記述し、S と z を φ(k)⊗v および φ(k) によって蓄積することで線形時間計算を可能にする。
Rfa-Gate を導入し、再帰的なゲーティング機構 g_t により履歴を滑らかに減衰させ、最近性バイアスを符号化する。
Rfa を softmax 注意のドロップイン置換として最小限のパラメータ増加（≈0.1%）で実現できる。
φ としてガウス分布および弧コサインのランダム特徴マップを検討し、q および k のノルムの正規化を考慮する。

実験結果

リサーチクエスチョン

RQ1シーケンス長に対して直線的にスケールするように注意を近似して、性能を犠牲にせずにできるか。
RQ2ランダム特徴ベースの注意（Rfa）は、言語モデル、翻訳、および長いシーケンス分類で標準の softmax 注意と同等以上の性能を発揮するか。
RQ3Rfa のゲーティング機構は最近性バイアスを捉え、局所性が要求されるタスクの性能を改善するか。
RQ4デコーディング時および長い入力に対して、素のトランスフォーマーと比較して Rfa の速度・メモリの利点はどの程度か。

主な発見

Rfa は WikiText-103 でベースのトランスフォーマーと同程度以上のパープレキシティを達成し、ゲーティングは顕著な改善をもたらす。
機械翻訳のベンチマークでは、すべての Rfa バリアントが Base トランスフォーマーより少なくとも約 1.8 倍速くデコードし、BLEU スコアは Base と同等。
長文分類タスクでは、Rfa は精度で競合し、いくつかの効率的なトランスフォーマーベンダーよりも速度・メモリの利点を提供する。
Rfa はデコード速度を大幅に向上させる（2048 長さの出力で最大 12 倍）一方、長いシーケンスに対してメモリ使用量を低減する。
ガウス特徴マップは、言語モデルの実験で一般により安定した学習と弧コサインより良い性能を示す。
ゲーティング変種（Rfa-Gate）は、特に WikiText-103 の言語モデルで有益な効果を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。