QUICK REVIEW

[論文レビュー] R-Transformer: Recurrent Neural Network Enhanced Transformer

Zhiwei Wang, Yao Ma|arXiv (Cornell University)|Jul 12, 2019

Neural Networks and Applications参考文献 32被引用数 89

ひとこと要約

R-Transformer は LocalRNN を局所構造に、マルチヘッド注意を全体依存に組み合わせ、位置埋め込みなしで強力な性能を達成し、いくつかの系列モデリングタスクで Transformer および TCN を上回る。

ABSTRACT

Recurrent Neural Networks have long been the dominating choice for sequence modeling. However, it severely suffers from two issues: impotent in capturing very long-term dependencies and unable to parallelize the sequential computation procedure. Therefore, many non-recurrent sequence models that are built on convolution and attention operations have been proposed recently. Notably, models with multi-head attention such as Transformer have demonstrated extreme effectiveness in capturing long-term dependencies in a variety of sequence modeling tasks. Despite their success, however, these models lack necessary components to model local structures in sequences and heavily rely on position embeddings that have limited effects and require a considerable amount of design efforts. In this paper, we propose the R-Transformer which enjoys the advantages of both RNNs and the multi-head attention mechanism while avoids their respective drawbacks. The proposed model can effectively capture both local structures and global long-term dependencies in sequences without any use of position embeddings. We evaluate R-Transformer through extensive experiments with data from a wide range of domains and the empirical results show that R-Transformer outperforms the state-of-the-art methods by a large margin in most of the tasks. We have made the code publicly available at \url{https://github.com/DSE-MSU/R-transformer}.

研究の動機と目的

局所的な構造を捉えつつ長距離依存を保持することで、系列モデリングの改善を動機づける。
LocalRNN とマルチヘッド注意を組み合わせたハイブリッドアーキテクチャを提案する。
位置埋め込みなしで動作可能であり、さまざまなドメインでベースラインを上回ることを示す。

提案手法

各位置で終わる局所ウィンドウを処理する LocalRNN を導入し、局所的な連続情報を符号化する位置表現を生成する。
マルチヘッド注意を用いたプーリングサブレイヤを適用して、位置間のグローバルな長期依存を捉える。
LocalRNN、注意、フィードフォワードサブレイヤにわたって残差接続と層正規化のスキームを適用し、R-Transformer の各層を形成する。
三層構造のレイヤーを使用する：LocalRNN（局所）、Multi-Head Attention（グローバル）、および Position-wise Feedforward、共有パラメータと並列計算を可能にする。
複数のデータセットにおいて R-Transformer を RNN、TCN、および Transformer と比較し、性能向上を評価する。

実験結果

リサーチクエスチョン

RQ1LocalRNN は局所的な逐次構造を効果的に符号化して、グローバルな注意ベースのモデルを強化できるか？
RQ2位置埋め込みを取り除くと性能が低下するか、それとも LocalRNN と注意が補完できるか？
RQ3局所性と長距離依存のバランスが異なるタスクにおいて、R-Transformer は RNN、TCN、Transformer とどう比較されるか？
RQ4モデルの学習と推論は、非リカレントなアーキテクチャと同様に効率的に並列化可能か？

主な発見

Model	# of layers / hidden size	Metric
R-Transformer	8 / 32	99.1% on MNIST test accuracy
Transformer	8 / 32	98.2% on MNIST test accuracy
TCN	8 / 25	99.0% on MNIST test accuracy
GRU	-	96.2% on MNIST test accuracy
LSTM	1 / 130	87.2% on MNIST test accuracy
RNN	-	21.5% on MNIST test accuracy

R-Transformer はピクセルごとの MNIST で Transformer および TCN より高いテスト精度を達成（99.1% 対 Transformer 98.2%、TCN 99.0%）.
ポリフォニック Nottingham 音楽モデリングでは、R-Transformer は NLL=2.37 を達成し、LSTM(3.29)、GRU(3.46)、TCN(3.07)、Transformer(3.34) を上回る。
Penn Treebank の文字レベル言語モデリングでは、R-Transformer が NLL=1.24 を達成し、Transformer(1.45) を上回り、RNN ベースのベースラインと同等かそれ以上。
語彙レベル PTB 言語モデリングでは、R-Transformer は perplexity の 84.38 を達成し、Transformer (122.37) および他のベースライン（RNN/GRU/LSTM/TCN）を上回る。
タスク全体で、R-Transformer は一貫して TCN および Transformer を上回り、局所性には LocalRNN、長距離依存にはマルチヘッド注意を活用する。
このモデルは系列位置全体で完全並列化を可能に実装され、位置埋め込みには依存しない。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。