QUICK REVIEW

[論文レビュー] Structured State Space Models for In-Context Reinforcement Learning

Chris Xiaoxuan Lu, Yannick Schroecker|arXiv (Cornell University)|Mar 7, 2023

Reinforcement Learning in Robotics被引用数 11

ひとこと要約

本論文は、S5 構造化状態空間モデルを強化学習に適用する際、軌跡内で隠れ状態のリセットを可能にすることで、推論をより高速化し、メモリベースおよびメタ強化学習タスクにおいて RNN よりも性能を向上させ、分布外一般化を含む。

ABSTRACT

Structured state space sequence (S4) models have recently achieved state-of-the-art performance on long-range sequence modeling tasks. These models also have fast inference speeds and parallelisable training, making them potentially useful in many reinforcement learning settings. We propose a modification to a variant of S4 that enables us to initialise and reset the hidden state in parallel, allowing us to tackle reinforcement learning tasks. We show that our modified architecture runs asymptotically faster than Transformers in sequence length and performs better than RNN's on a simple memory-based task. We evaluate our modified architecture on a set of partially-observable environments and find that, in practice, our model outperforms RNN's while also running over five times faster. Then, by leveraging the model's ability to handle long-range sequences, we achieve strong performance on a challenging meta-learning task in which the agent is given a randomly-sampled continuous control environment, combined with a randomly-sampled linear projection of the environment's observations and actions. Furthermore, we show the resulting model can adapt to out-of-distribution held-out tasks. Overall, the results presented in this paper show that structured state space models are fast and performant for in-context reinforcement learning tasks. We provide code at https://github.com/luchris429/popjaxrl.

研究の動機と目的

強化学習における構造化状態空間モデル（S4/S5）の有効な活用を動機づけ、実現する。
ポリシーに基づく RL トレーニングにおけるエピソード境界と可変長のローリングアウトの課題に対処する。
リセット可能な S5 変種が RNN の性能と同等またはそれを上回り、実行時をより速くすることを示す。
ランダム射影を用いた長期的・部分観測・メタ学習型 RL タスクへの一般化を示す。

提案手法

軌跡内で隠れ状態の並行初期化とリセットを可能にするように S5 を修正する。
結合性を保つために、完了フラグを処理するリセット可能な連想演算子 ⊕ を導入する。
観測と行動のランダムな線形射影を用いて広範なメタ学習タスク分布を作成する。
エンドツーエンドの評価を高速化するために POPGym 環境を JAX で再実装する。
memory-length bsuite タスク、POP Gym シリーズ、および分布外一般化を伴う多環境メタ RL を評価する。

実験結果

リサーチクエスチョン

RQ1S5 アーキテクチャはトレーニング系列内でリセットして、RL ロールアウトのエピソード境界を扱うことができるか。
RQ2リセット可能な S5 は memory-based および meta-RL タスクにおいて LSTM および Transformer より実用的なスピードアップと性能向上を提供するか。
RQ3観測と行動のランダム射影に露出させた場合、S5 は長期的・部分観測・未知の分布外タスクへ一般化するか。
RQ4状態/行動空間のランダム射影を用いた多環境メタ学習設定で S5 はどのように機能するか。
RQ5ランダム射影 RL タスクにおいて S5 で文脈内適応は実現可能か。

主な発見

リセット機能を備えた S5 は長い系列に対して漸近的に Transformer より速く、メモリベースのタスクで RNN を上回る。
bsuite の memory-length タスクでは、S5 がより高いスコアを達成し、ベースラインの RNN アプローチよりほぼ 2 倍の速度。
POP Gym では S5 は GRU を上回り、実行が 6 倍以上速く、Repeat Previous Hard タスクを解決。
ランダム環境と射影を用いた長文脈メタ RL 設定で、S5 は LSTM より高いリターンを達成し、微調整なしで分布外への転移を一部示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。