QUICK REVIEW

[論文レビュー] Learning to Encode Position for Transformer with Continuous Dynamical Model

Xuanqing Liu, Hsiang‐Fu Yu|arXiv (Cornell University)|Mar 12, 2020

Natural Language Processing Techniques被引用数 56

ひとこと要約

本論文は FLOATER を導入する。FLOATER は Transformer に対するフローに基づく連続動的位置エンコーダで、帰納的でデータ駆動かつパラメータ効率の高い位置エンコーディングを可能にし、機械翻訳、言語理解、QA タスクを改善する。

ABSTRACT

We introduce a new way of learning to encode position information for non-recurrent models, such as Transformer models. Unlike RNN and LSTM, which contain inductive bias by loading the input tokens sequentially, non-recurrent models are less sensitive to position. The main reason is that position information among input units is not inherently encoded, i.e., the models are permutation equivalent; this problem justifies why all of the existing models are accompanied by a sinusoidal encoding/embedding layer at the input. However, this solution has clear limitations: the sinusoidal encoding is not flexible enough as it is manually designed and does not contain any learnable parameters, whereas the position embedding restricts the maximum length of input sequences. It is thus desirable to design a new position layer that contains learnable parameters to adjust to different datasets and different architectures. At the same time, we would also like the encodings to extrapolate in accordance with the variable length of inputs. In our proposed solution, we borrow from the recent Neural ODE approach, which may be viewed as a versatile continuous version of a ResNet. This model is capable of modeling many kinds of dynamical systems. We model the evolution of encoded results along position index by such a dynamical system, thereby overcoming the above limitations of existing methods. We evaluate our new position layers on a variety of neural machine translation and language understanding tasks, the experimental results show consistent improvements over the baselines.

研究の動機と目的

再帰的でない Transformer における学習可能で帰納的な位置エンコーディングの必要性を動機づける。
位置エンコーディングを生成する連続的動的システムとして FLOATER を提案する。
FLOATER がデータ駆動でパラメータ効率が高く、標準的な Transformer アーキテクチャと互換性があることを保証する。
FLOATER の機械翻訳、言語理解、QA ベンチマークにおける改良を示す。

提案手法

位置エンコーディングを p(t) をニューラルネットワーク h(t, p(t); θ_h) によって駆動される連続的動的システムとしてモデル化する。
一定の Δt を用いて t_i を増加させながら評価することにより p(i) を離散化し、各トークンの位置ベクトルを得る。
Transformer ブロック間でダイナミクス h(·) を共有してパラメータを削減しつつ、ブロックごとに異なる初期値 p(0) を許す。
互換性のため h(·)=0 のとき FLOATER が元の正弦波エンコーディングに縮退することを示す。
性能向上のため、オプションとして全ての Transformer ブロックに動的エンコーディングを注入する。
事前学習済み Transformer から FLOATER を初期化して微調整することでウォームスタート戦略を提供する。

実験結果

リサーチクエスチョン

RQ1位置エンコーディングの連続的動的システムは、固定のサイン波エンコーディングやレイヤー単位の埋め込みよりも、帰納的・データ駆動・パラメータ効率の高い改善を提供し得るか。
RQ2FLOATER はベースラインと比較して、ニューラル機械翻訳、言語理解、質問応答タスクの性能でどのように推移するか。
RQ3FLOATER を全ブロックに適用する場合と入力ブロックのみに適用する場合の影響は何か。
RQ4事前学習済み Transformer モデルとの互換性はどれくらいか、そしてウォームスタート学習が性能にどのように影響するか。

主な発見

Model	BLEU (↑)	#Parameters (↓)
FLOATER	28.57	526.3K
1-layer RNN + scalar	27.99	263.2K
2-layer RNN + scalar	28.16	526.3K
1-layer RNN + vector	27.99	1,050.0K

FLOATER は MT、GLUE、RACE、SQuAD タスクでベースラインを一貫して上回る。
全 Transformer ブロックで FLOATER を使用する方が、入力ブロックのみに適用するより性能が良い。
FLOATER は h(·)=0 のとき正弦波エンコードに縮退して Vanilla Transformer との互換性を保ち、事前学習済みモデルからのウォームスタートを可能にする。
WMT En-De では FLOATER は 28.57 の BLEU、パラメータ数 526.3K。さまざまなパラメータ予算を持つ複数の RNN ベースエンコーダを上回る。
FLOATER は帰納的な挙動を示し、訓練中に見られなかった長いシーケンスでも特に MT で良好に機能する。
FLOATER の訓練にはオーバーヘッドが生じるが、ウォームスタートとパラメータ共有戦略でオーバーヘッドは控えめに保てる（約20-30%）。推論時のオーバーヘッドは位置バイアスの格納により回避される。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。