QUICK REVIEW

[論文レビュー] Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration

Desik Rengarajan, Gargi Vaidya|arXiv (Cornell University)|Feb 9, 2022

Reinforcement Learning in Robotics被引用数 26

ひとこと要約

LOGO はオフラインのデモンストレーションデータをオンラインの TRPO 学習の指針として活用し、疎報酬強化学習においてほぼ最適な性能と不完全観測下での堅牢な完遂を実現する。

ABSTRACT

A major challenge in real-world reinforcement learning (RL) is the sparsity of reward feedback. Often, what is available is an intuitive but sparse reward function that only indicates whether the task is completed partially or fully. However, the lack of carefully designed, fine grain feedback implies that most existing RL algorithms fail to learn an acceptable policy in a reasonable time frame. This is because of the large number of exploration actions that the policy has to perform before it gets any useful feedback that it can learn from. In this work, we address this challenging problem by developing an algorithm that exploits the offline demonstration data generated by a sub-optimal behavior policy for faster and efficient online RL in such sparse reward settings. The proposed algorithm, which we call the Learning Online with Guidance Offline (LOGO) algorithm, merges a policy improvement step with an additional policy guidance step by using the offline demonstration data. The key idea is that by obtaining guidance from - not imitating - the offline data, LOGO orients its policy in the manner of the sub-optimal policy, while yet being able to learn beyond and approach optimality. We provide a theoretical analysis of our algorithm, and provide a lower bound on the performance improvement in each learning episode. We also extend our algorithm to the even more challenging incomplete observation setting, where the demonstration data contains only a censored version of the true state observation. We demonstrate the superior performance of our algorithm over state-of-the-art approaches on a number of benchmark environments with sparse rewards and censored state. Further, we demonstrate the value of our approach via implementing LOGO on a mobile robot for trajectory tracking and obstacle avoidance, where it shows excellent performance.

研究の動機と目的

RL における疎報酬信号下での学習の課題に対応する。
サブ最適なポリシーからのオフラインデモデータを活用してオンライン学習をブートストラップし、指導する。
ポリシー改善とデモンストレーションに guided ポリシー選択を組み合わせた二段階の LOGO フレームワークを開発する。
性能改善に関する理論的保証を提供し、不完全観測設定へ拡張する。
MuJoCo ベンチマークと実ロボット実験（TurtleBot）での有効性を示す。

提案手法

ポリシー改善ステップとして TRPO を使用して候補ポリシーを生成する。
候補ポリシーの周囲のトラスト領域内でオフライン挙動ポリシーに近いポリシーを探索するポリシーガイダンスステップを追加する。
中間ポリシーからのサンプルを用いてポリシー依存の KL 発散を近似する代理目的関数を導入する。
代理目的関数を支援するためにポリシー依存報酬の性能差補題の拡張を導出する。
Taylor 展開に基づく実装可能な更新を提供し、TRPO に類似する二つの更新を生み出す。
状態を射影して部分データからポリシー依存報酬を推定する識別器を訓練することで、Incomplete observation に LOGO を拡張する。

実験結果

リサーチクエスチョン

RQ1LOGO はオフラインデモを用いた疎報酬設定で純粋な TRPO より性能改善を達成できるか。
RQ2サブ最適な行動ポリシーからのガイダンスが探索とサンプル効率にどう影響するか。
RQ3各学習エピソードあたりの性能改善に関する理論的保証は何か。
RQ4不完全な状態観測設定に拡張しても有効性を維持できるか。
RQ5MuJoCo ベンチマークから実世界のロボティクス（Gazebo/TurtleBot）タスクに結果が適用されるか。

主な発見

LOGO は疎報酬環境で、ベースライン TRPO および模倣学習アプローチと比べて学習を速くし、ほぼ最適な性能を達成する。
二段階の LOGO 手法（ポリシー改善＋ポリシーガイダンス）は正式な性能保証を提供し、行動ポリシーのガイダンスを通じて初期学習を加速する。
LOGO は疎報酬にもかかわらず、標準の MuJoCo ベンチマークで密報酬最適アルゴリズムの性能に匹敵できる。
識別器ベースの代理の報酬を用いた不完全観測設定への拡張で、性能を維持しつつ強力な性能を保つ。
LOGO は Gazebo および実世界の実験で TurtleBot における有効なウェイポイント追従と障害物回避を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。