QUICK REVIEW

[論文レビュー] COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning

Avi Singh, Albert S. Yu|arXiv (Cornell University)|Oct 27, 2020

Robot Manipulation and Learning参考文献 42被引用数 38

ひとこと要約

COG はオフライン強化学習を用いてタスク固有データと大規模なラベルなし事前データを統合し、以前に学習した行動を組み合わせて新しい多段タスクを、未知の初期条件から解く方針を可能にする。

ABSTRACT

Reinforcement learning has been applied to a wide variety of robotics problems, but most of such applications involve collecting data from scratch for each new task. Since the amount of robot data we can collect for any single task is limited by time and cost considerations, the learned behavior is typically narrow: the policy can only execute the task in a handful of scenarios that it was trained on. What if there was a way to incorporate a large amount of prior data, either from previously solved tasks or from unsupervised or undirected environment interaction, to extend and generalize learned behaviors? While most prior work on extending robotic skills using pre-collected data focuses on building explicit hierarchies or skill decompositions, we show in this paper that we can reuse prior data to extend new skills simply through dynamic programming. We show that even when the prior data does not actually succeed at solving the new task, it can still be utilized for learning a better policy, by providing the agent with a broader understanding of the mechanics of its environment. We demonstrate the effectiveness of our approach by chaining together several behaviors seen in prior datasets for solving a new task, with our hardest experimental setting involving composing four robotic skills in a row: picking, placing, drawer opening, and grasping, where a +1/0 sparse reward is provided only on task completion. We train our policies in an end-to-end fashion, mapping high-dimensional image observations to low-level robot control commands, and present results in both simulated and real world domains. Additional materials and source code can be found on our project website: https://sites.google.com/view/cog-rl

研究の動機と目的

事前の、タスク非特異的データがロボット工学におけるポリシー一般化を拡張しうる理由を動機付ける。
階層を明示せず、オフライン強化学習を用いて挙動をつなぎ合わせる、単純でデータ駆動的な方法を提案する。
未知の初期条件から新しい多段タスクの学習を、事前データが支援し得ることを実証する。
オフラインデータと疎報酬を用いた視覚観察から低レベル制御へのエンドツーエンド学習を示す。

提案手法

保守的Q学習(CQL)を拡張して、オフラインRLで事前データとタスク固有データの両方を取り込む。
リプレイバッファを、報酬ゼロにラベル付けした事前データで初期化し、その後、事前データとタスクデータを混合して訓練する。
タスク報酬経路からの価値を、事前データが含まれる領域へ伝播させるためにQ学習のダイナミクスを用いる。
オフライン訓練後、制限的なオンライン相互作用でオフラインポリシーを任意に微調整する。
48×48 または 64×64 の画像とロボット状態を連続的な6自由度アクションと離散的グリッパー制御へマッピングするエンドツーエンドのネットワーク（ConvNets）を訓練する。

実験結果

リサーチクエスチョン

RQ1モデルフリーのオフラインRLは、タスク非依存の事前データを活用して新しいスキルを学習できるか？
RQ2ポリシーは、以前のデータで見られた挙動を組み合わせて、新しい初期条件から新しいタスクを解決できるか？
RQ3事前データを組み込む際、事前データを用いたオフラインRLはビヘイビアラーニングのベースラインとどう比較されるか？
RQ4事前データを用いたオフライン学習の後、オンライン微調整は必要か、あるいは有益か？
RQ5このアプローチはシミュレーションを超えた現実世界のロボット環境へどの程度一般化できるか？

主な発見

COG は、データに完全な連続を見なくとも、引き出しの開閉、把握、障害物の除去を組み合わせて多段タスクを解決できるようにする。
COG は、シミュレーションの新規初期条件において、ビヘイビアラーニングのベースライン、SAC、アブレーションを上回る。
オンライン微調整は、追加データが比較的控えめでも、引き出しタスクの成功率を90％超えへとさらに引き上げる。
現実世界の実験では、引き出しが閉じた状態から開始した場合7/8の成功を達成し、BCオラクルベースラインを上回った。
BC-init は見たことのない初期条件を解決できず、事前トレーニングだけよりもオフラインデータ統合の価値を強調する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。