QUICK REVIEW

[論文レビュー] A Minimalist Approach to Offline Reinforcement Learning

Scott Fujimoto, Shixiang Gu|arXiv (Cornell University)|Jun 12, 2021

Reinforcement Learning in Robotics参考文献 56被引用数 164

ひとこと要約

TD3+BC は、TD3 に単一の行動模倣項を追加しデータを正規化することで、複雑なオフラインRL手法と同等の最先端性能を、はるかに少ない複雑さと計算量で達成する。

ABSTRACT

Offline reinforcement learning (RL) defines the task of learning from a fixed batch of data. Due to errors in value estimation from out-of-distribution actions, most offline RL algorithms take the approach of constraining or regularizing the policy with the actions contained in the dataset. Built on pre-existing RL algorithms, modifications to make an RL algorithm work offline comes at the cost of additional complexity. Offline RL algorithms introduce new hyperparameters and often leverage secondary components such as generative models, while adjusting the underlying RL algorithm. In this paper we aim to make a deep RL algorithm work while making minimal changes. We find that we can match the performance of state-of-the-art offline RL algorithms by simply adding a behavior cloning term to the policy update of an online RL algorithm and normalizing the data. The resulting algorithm is a simple to implement and tune baseline, while more than halving the overall run time by removing the additional computational overhead of previous methods.

研究の動機と目的

実装とハイパーパラメータのオーバーヘッドを削減するミニマリストなオフラインRL アプローチを動機づける。
オンラインアルゴリズムの簡単な変更が追加コンポーネントなしでオフラインでも良好に機能するかを調査する。
データ正規化と BC 項がオフライン学習を安定化させ、改善できることを示す。
標準ベンチマークで最新性能と一致する再現性の高いベースラインを提供する。

提案手法

方針更新に対して、ポリシー更新に行動模倣正則化項を追加して TD3 から開始する。
データセットの状態特徴を平均0・分散1になるよう正規化する。
BC/QL のバランスを制御するラムダスケーリングを導入する： lambda = alpha / (1/N) sum|Q(s,a)|（ミニバッチごとに推定）。
正則化の強さを制御する単一のハイパーパラメータ alpha（デフォルト値 2.5）を使用する。
基礎となる TD3 アップデートのさらに少数行のコード変更にとどめ、変更を最小限に保つ。

実験結果

リサーチクエスチョン

RQ1基盤のオンラインアルゴリズムにごくわずかな変更で、深層RLアルゴリズムをオフラインで有効にできるか。
RQ2単純な BC 正則化とデータ正規化だけで、標準ベンチマークにおいて最新のオフラインRL法と同等の性能を達成できるか。
RQ3正規化と BC 項がオフラインRLにおける安定性と性能に与える影響は何か。

主な発見

TD3+BC は D4RL MuJoCo ベンチマークで Fisher-BRC と同等の性能を達成する。
TD3+BC は CQL および Fisher-BRC に比べて著しく少ない計算時間で済む（総訓練時間の概ね半分未満）。
状態正規化はオフラインRLにおいて非自明な安定性と性能の利得を提供する。
単一のハイパーパラメータ(alpha) が RL と模倣学習のバランスを支配し、多くの設定でタスク間のロバスト性がある。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。