QUICK REVIEW

[論文レビュー] Search on the Replay Buffer: Bridging Planning and Reinforcement Learning

Benjamin Eysenbach, Ruslan Salakhutdinov|arXiv (Cornell University)|Jun 12, 2019

Reinforcement Learning in Robotics被引用数 70

ひとこと要約

SoRB は、リプレイバッファ状態上の距離グラフを構築し、遠い目標への最短経路を計画することで、グラフ探索による計画とゴール条件付き強化学習を組み合わせます。

ABSTRACT

The history of learning for control has been an exciting back and forth between two broad classes of algorithms: planning and reinforcement learning. Planning algorithms effectively reason over long horizons, but assume access to a local policy and distance metric over collision-free paths. Reinforcement learning excels at learning policies and the relative values of states, but fails to plan over long horizons. Despite the successes of each method in various domains, tasks that require reasoning over long horizons with limited feedback and high-dimensional observations remain exceedingly challenging for both planning and reinforcement learning algorithms. Frustratingly, these sorts of tasks are potentially the most useful, as they are simple to design (a human only need to provide an example goal state) and avoid reward shaping, which can bias the agent towards finding a sub-optimal solution. We introduce a general control algorithm that combines the strengths of planning and reinforcement learning to effectively solve these tasks. Our aim is to decompose the task of reaching a distant goal state into a sequence of easier tasks, each of which corresponds to reaching a subgoal. Planning algorithms can automatically find these waypoints, but only if provided with suitable abstractions of the environment -- namely, a graph consisting of nodes and edges. Our main insight is that this graph can be constructed via reinforcement learning, where a goal-conditioned value function provides edge weights, and nodes are taken to be previously seen observations in a replay buffer. Using graph search over our replay buffer, we can automatically generate this sequence of subgoals, even in image-based environments. Our algorithm, search on the replay buffer (SoRB), enables agents to solve sparse reward tasks over one hundred steps, and generalizes substantially better than standard RL algorithms.

研究の動機と目的

過去の観測を基に計画によって自動的に見つけられるサブゴールへ分解することで、高次元観測における長期的な報酬希薄なタスクの課題に対処する。
各サブゴールを解決するためにゴール条件付きRLポリシーを活用し、リプレイバッファを計画のノンパラメトリックな状態グラフとして使用する。
分布型RLとアンサンブルを用いて、グラフ探索を導く頑健な距離推定を取得する。
長期的ナビゲーションタスクで標準的なRLより性能が向上することを示し、未知の環境への一般化を示す。

提案手法

ゴールリラベリングと分布型RLを用いたオフポリシーRLアルゴリズムで、ゴール条件付きポリシーとそのQ値関数を学習する。
最短経路距離 d_sp(s,s_g) を定義し、V(s,s_g) と Q(s,a,s_g) を負の最短経路距離と関連付ける。
推定距離に等しい辺重みを持つリプレイバッファ観測上のグラフを構築し、MaxDist で上限を設ける。
リプレイバッファグラフ上で出発点と目標点の間の最短経路を求めるためにダイクストラ法を用いる。
実行時には、経路に沿ってウェイポイントのシーケンスを計画し、次のウェイポイントにポリシーを条件付けるか、より近い場合は直接ゴールに条件付ける。
距離推定を分布型RL（距離ステップを表すビン）と不확実性のためのアンサンブルで強化する。

実験結果

リサーチクエスチョン

RQ1グラフ探索を介してリプレイバッファ上の計画が、高次元環境で遠い目標に到達するサブゴールの一連のシーケンスを見つけられるか？
RQ2分布型RLとアンサンブルによって学習された距離推定は SoRB の信頼できる計画指針を生むか？
RQ3SoRB は長期的な、報酬希薄なタスクで成功率を向上させ、未知環境への一般化を標準的なゴール条件付きRLと比較して改善するか？
RQ4画像ベースのナビゲーションタスクにおいて、SoRB は半パラメトリックトポロジカルメモリ（SPTM）や他のベースラインとどのように比較されるか？

主な発見

SoRB は100ステップ以上の長期的、報酬希薄なタスクの解決を可能にし、標準的なRL法より一般化が良い。
ゴール条件付き価値ベースの距離に導かれたリプレイバッファ上のグラフ探索は、画像ベースのドメインにおけるナビゲーションのための有効なウェイポイント列を生み出す。
分布型RLとアンサンブルは、特に遠い目標に対する距離推定と計画の頑健性を大幅に改善する。
視覚的ナビゲーションでは、SoRB はSPTM、C51、VIN、HERを含むベースラインを大幅に上回り、特に目標距離が増すにつれて顕著である。
SoRB はSUNCGの新しい家に一般化し、距離が増すにつれて目標の成功率を高く維持する一方で、純粋なゴール条件付きRLは苦戦する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。