QUICK REVIEW

[論文レビュー] Stabilizing Transformers for Reinforcement Learning

Emilio Parisotto, Hao Song|arXiv (Cornell University)|Oct 13, 2019

Reinforcement Learning in Robotics参考文献 39被引用数 131

ひとこと要約

本論文は、Gated Transformer-XL (GTrXL) を紹介する。ゲート付きで再順序化された層のトランスフォーマーアーキテクチャで、訓練を安定化させ、メモリベースの強化学習の性能を向上させる。DMLab-30 およびスケーラブルなメモリタスクで LSTM および外部メモリを上回る。

ABSTRACT

Owing to their ability to both effectively integrate information over long time horizons and scale to massive amounts of data, self-attention architectures have recently shown breakthrough success in natural language processing (NLP), achieving state-of-the-art results in domains such as language modeling and machine translation. Harnessing the transformer's ability to process long time horizons of information could provide a similar performance boost in partially observable reinforcement learning (RL) domains, but the large-scale transformers used in NLP have yet to be successfully applied to the RL setting. In this work we demonstrate that the standard transformer architecture is difficult to optimize, which was previously observed in the supervised learning setting but becomes especially pronounced with RL objectives. We propose architectural modifications that substantially improve the stability and learning speed of the original Transformer and XL variant. The proposed architecture, the Gated Transformer-XL (GTrXL), surpasses LSTMs on challenging memory environments and achieves state-of-the-art results on the multi-task DMLab-30 benchmark suite, exceeding the performance of an external memory architecture. We show that the GTrXL, trained using the same losses, has stability and performance that consistently matches or exceeds a competitive LSTM baseline, including on more reactive tasks where memory is less critical. GTrXL offers an easy-to-train, simple-to-implement but substantially more expressive architectural alternative to the standard multi-layer LSTM ubiquitously used for RL agents in partially observable environments.

研究の動機と目的

長距離・部分観測的な RL 問題に対してトランスフォーマーの利用を動機づける。
RL 設定における標準的なトランスフォーマーの訓練不安定性を特定する。
学習を安定化させるためのアーキテクチャ的改変（アイデンティティマップの再配置とゲーティング）を提案する。
GTrXL がメモリベースのベンチマークで LSTMs および外部メモリを上回ることを実証する。
種やハイパーパラメータに対してロバストで、競争力のある性能を維持することを示す。

提案手法

RL メモリのために相対位置エンコーディングを持つ Transformer-XL アーキテクチャを適用する。
TrXL-I（LayerNorm をサブモジュールの入力ストリームのみに配置する）としてアイデンティティマップ再配置を適用する。
残差接続を置換するゲーティング機構を導入し、MHA および MLP サブモジュールを GTrXL とする。
最も強力な変種として GRU 型ゲーティング（GTrXL GRU）と、複数のアブレーション（Input、Output、Highway、SigTanh）を探索する。
マルコフ方策学習をブートストラップする近似的同一性マッピングを促すようにゲーティングバイアスを初期化する。
領域ごとに学習安定性と性能を評価するために V-MPO（オンポリシ MPO の一 variant ）で訓練する。

実験結果

リサーチクエスチョン

RQ1トランスフォーマーは RL エージェントの memory アーキテクチャとして十分に安定化できるか。
RQ2トランスフォーマーの RL 訓練安定性を改善するためのアーキテクチャ変更（層正規化の順序とゲーティング）は何か。
RQ3GTrXL は memory 要求の高い RL ベンチマークで LSTMs や外部メモリと比較してどの程度性能を発揮するか。
RQ4GTrXL はハイパーパラメータ、種、異なるメモリホライズンに対してどれだけロバストか。
RQ5GTrXL はメモリホライズンが拡大しても伝統的なアーキテクチャより優れた性能を維持できるか。

主な発見

モデル	平均ヒューマン正規化スコア	平均ヒューマン正規化スコア（100点満点）
LSTM	99.3 ± 1.0	84.0 ± 0.4
TrXL	5.0 ± 0.2	5.0 ± 0.2
TrXL-I	107.0 ± 1.2	87.4 ± 0.3
MERLIN@100B	115.2	89.4
GTrXL (GRU)	117.6 ± 0.3	89.1 ± 0.2
GTrXL (Input)	51.2 ± 13.2	47.6 ± 12.1
GTrXL (Output)	112.8 ± 0.8	87.8 ± 0.3
GTrXL (Highway)	90.9 ± 12.9	75.2 ± 10.4
GTrXL (SigTanh)	101.0 ± 1.3	83.9 ± 0.7

GRU ゲーティングを組み込んだ GTrXL は、メモリベース環境における DMLab-30 で 3 層 LSTM ベースラインを大きく上回る。
GTrXL (GRU) はマルチタスク DMLab-30 ベンチマークで最先端の結果を達成し、最終性能で外部メモリ MERLIN を上回る。
Numpad タスクで LSTMs よりメモリホライズンのスケーリングが良好であり、メモリ要求が高まっても優れた性能を維持する。
ゲーティングを備えた GTrXL は他のゲーティングオプションより安定性と学習速度で優れ、GRU ゲーティングが全タスクで最も強い結果を提供する。
GTrXL はメモリがそれほど重要でない反応性タスクでも競争力があるか、あるいは優れていることを示し、LSTMs の RL メモリ代替として広い適用性を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。