QUICK REVIEW

[論文レビュー] Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View

Yiping Lu, Zhuohan Li|arXiv (Cornell University)|Jun 6, 2019

Topic Modeling参考文献 46被引用数 116

ひとこと要約

本論文は Transformer を多粒子拡散対流系の数値ODE解法として再解釈し、次に Strang-Marchuk 分割を用いた Macaron Net を提案し、標準の Transformer より性能を向上させる。

ABSTRACT

The Transformer architecture is widely used in natural language processing. Despite its success, the design principle of the Transformer remains elusive. In this paper, we provide a novel perspective towards understanding the architecture: we show that the Transformer can be mathematically interpreted as a numerical Ordinary Differential Equation (ODE) solver for a convection-diffusion equation in a multi-particle dynamic system. In particular, how words in a sentence are abstracted into contexts by passing through the layers of the Transformer can be interpreted as approximating multiple particles' movement in the space using the Lie-Trotter splitting scheme and the Euler's method. Given this ODE's perspective, the rich literature of numerical analysis can be brought to guide us in designing effective structures beyond the Transformer. As an example, we propose to replace the Lie-Trotter splitting scheme by the Strang-Marchuk splitting scheme, a scheme that is more commonly used and with much lower local truncation errors. The Strang-Marchuk splitting scheme suggests that the self-attention and position-wise feed-forward network (FFN) sub-layers should not be treated equally. Instead, in each layer, two position-wise FFN sub-layers should be used, and the self-attention sub-layer is placed in between. This leads to a brand new architecture. Such an FFN-attention-FFN layer is "Macaron-like", and thus we call the network with this new architecture the Macaron Net. Through extensive experiments, we show that the Macaron Net is superior to the Transformer on both supervised and unsupervised learning tasks. The reproducible codes and pretrained models can be found at https://github.com/zhuohan123/macaron-net

研究の動機と目的

多粒子ダイナミックシステム（MPDS）と常微分方程式理論を通じて、Transformer の新規解釈を提供する。
数値解析（Lie-Trotter 対 Strang-Marchuk 分割）を活用して、より正確なニューラルアーキテクチャを設計する。
Macaron Net が監視付きおよび教師なしの NLP タスクで標準の Transformer を上回ることを実証する。

提案手法

FFN による対流と自己注意による拡散を含む MPDS の ODE 解法として Transformer 層をモデリングする。
層の積み重ねを時間のオイラーステップに対応づけるために Lie-Trotter 分割を用いる。
Lie-Trotter を Strang-Marchuk 分割に置換して、3つのサブ層を持つ Macaron+ アーキテクチャ（FFN-半分、Attention、FFN-半分）を作成する。
Macaron 層を FFN- Attention-FFN として、半ステップ残差と全ステップ残差をそれぞれ適切に定義する。
Macaron Net を、Transformer のベースラインと同等のパラメータ数で Macaron 層を積み重ねて構築する。
機械翻訳（IWSLT14 De-En、WMT14 En-De）および GLUE 風の教師なし事前学習（BERT 系）で経験的に評価する。

Figure 1 : Physical interpretation of Transformer.

実験結果

リサーチクエスチョン

RQ1Transformer は対流拡散 MPDS の数値ODE解法として理解できるか？
RQ2Strang-Marchuk 分割方式を採用することで、神経アーキテクチャにおいて Lie-Trotter より精度と性能が向上するか？
RQ3同じパラメータ予算の下で、Macaron 層（FFN-Attention-FFN）はより良い NLP 性能をもたらすか？
RQ4教師あり翻訳と教師なし事前学習タスクにおいて、Macaron Net の性能は Transformer と比較してどうか？
RQ5ODE ベース設計原理をアテンションベースの NLP モデルへより深く組み込むことで、どのような経験的利益が生じるか？

主な発見

IWSLT14 De-En（小型）BLEU	WMT14 En-De（ベース）BLEU	WMT14 En-De（ビッグ）BLEU
34.4	27.3	28.4
/	28.4	28.9
/	26.8	29.2
/	28.9	/
/	/	29.3
35.2	/	29.7
35.4	28.9	30.2

Macaron Net は IWSLT14 De-En（小型で 35.4 対 34.4）および WMT14 En-De（ベース 28.9、ビッグ 30.2）で Transformer より BLEU が高い。
Table 1 によれば、WMT14 En-De で Macaron Net Big は Transformer Big を 1.8 BLEU ポイント上回る。
GLUE では Macaron Net base が BERT base およびすべてのベースラインを上回り、より高い総合 GLUE スコアを達成する。
監視付き MT では、IWSLT14 De-En で Macaron small が Transformer small を 1.0 BLEU 上回る；WMT14 En-De では Macaron base が Transformer base を 1.6 BLEU ポイント上回る。
Macaron Net を用いた教師なし事前学習は、ベースラインの BERT/Transformer 構成より下流タスクの性能を向上させる。
理論分析は Strang-Marchuk 分割が局所打ち切り誤差を O(γ^2) から O(γ^3) に低減すると予測し、Macaron 層設計を動機づける。

Figure 2 : The Transformer and our Macaron architectures.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。