QUICK REVIEW

[論文レビュー] Conservative Q-Learning for Offline Reinforcement Learning

Aviral Kumar, Aurick Zhou|arXiv (Cornell University)|Jun 8, 2020

Reinforcement Learning in Robotics参考文献 60被引用数 535

ひとこと要約

Conservative Q-Learning (CQL) は offline RL においてポリシー価値の下限を保証する保守的なQ-関数を学習し、過大評価を抑制し、離散タスクおよび連続タスクのパフォーマンスを向上させる。

ABSTRACT

Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.

研究の動機と目的

RL におけるオンライン対話のデータ効率の良い代替手段としてオフライン強化学習を位置づける。
固定データセットで学習する際の過大評価と分布シフトに対処する。
ポリシー価値の下限を提供する保守的なQ-関数フレームワークを提案する。
最小限のコード変更と強い実証結果によって頑健性と実用的な適合性を示す。

提案手法

データに整合した状態-行動分布の下でQ値を最小化する正則化されたQ-関数目的関数として保守的Q-Learning (CQL) を導入する。
学習されたQ-関数が真のQ-関数およびポリシー価値を下回る下限であることを示す理論的保証を導出する。
統一最適化フレームワーク内で2つの実装例（CQL(H) と CQL(R）を提供し、任意のKLベース正則化を組み込む。
SAC や QR-DQN の上に約20行のコードで最小限の実装労力でCQLを組み込む。
安全性・保証の結果を提供し：保守的なポリシー改善とギャップ拡張バックアップによりOOD（分布外）アクションを緩和する。

実験結果

リサーチクエスチョン

RQ1保守的なQ-関数はオフラインRLにおいてポリシー価値の信頼できる下限を導くか。
RQ2CQL は明示的な挙動ポリシーのモデリングなしに安全で性能を改善するポリシー更新を提供できるか。
RQ3複雑で多モードなデータセットを含む連続・離散ドメイン全体でCQLはどのように機能するか。

主な発見

CQL はいくつかのベンチマークタスクで従来のオフラインRL手法より最終リターンを2～5倍高くする。
CQL は現実的なデータセットでしばしば単純な行動クローンより優れている。
この手法はQ関数推定誤差に対して頑健であり、Q-学習とアクター-クリティックの両方の実装をサポートする。
CQL は既存のオンラインRLアルゴリズムの上に小さなコード追加で実装でき、単純な正則化項を用いる。
実証結果は高次元の視覚入力と多モーダルデータ分布を含み、広範な適用性を示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。