QUICK REVIEW

[論文レビュー] Learning from Demonstrations via Capability-Aware Goal Sampling

Ye Duan, Yuning Wang|arXiv (Cornell University)|Jan 13, 2026

Reinforcement Learning in Robotics被引用数 0

ひとこと要約

Cago はデモンストレーションからの学習を導くために能力認識型ゴールサンプリングを導入し、長期的かつ疎な報酬タスクにおけるサンプル効率と最終性能を向上させる適応カリキュラムを形成します。デモンストレーションに整合した Go-Explore、BC Explorer、World-model ベースのイマジネーションループを用います。

ABSTRACT

Despite its promise, imitation learning often fails in long-horizon environments where perfect replication of demonstrations is unrealistic and small errors can accumulate catastrophically. We introduce Cago (Capability-Aware Goal Sampling), a novel learning-from-demonstrations method that mitigates the brittle dependence on expert trajectories for direct imitation. Unlike prior methods that rely on demonstrations only for policy initialization or reward shaping, Cago dynamically tracks the agent's competence along expert trajectories and uses this signal to select intermediate steps--goals that are just beyond the agent's current reach--to guide learning. This results in an adaptive curriculum that enables steady progress toward solving the full task. Empirical results demonstrate that Cago significantly improves sample efficiency and final performance across a range of sparse-reward, goal-conditioned tasks, consistently outperforming existing learning from-demonstrations baselines.

研究の動機と目的

長期・疎報酬タスクにおいて正確な模倣が現実的でない状況で模倣学習を動機づける。
デモンストレーションを直接の模倣ではなく、ゴール指向の学習を支えるフレームワークを提案する。
エージェントの現在の能力の境界に位置する中間ゴールをサンプリングする能力認識メカニズムを開発する。
デモ endpoints を越えた一般化を可能にするテスト時のゴール推定を自動化するゴール予測器を導入する。

提案手法

デモンストレーションを構造化されたロードマップとして表現し、各デモでステージへ到達するエージェントの能力を追跡する。
Dict_visitという訪問辞書を保持し、エージェントがどのデモ観察に接近したかを監視する。
デモンストレーション中のエージェントの現在の能力を中心とした能力認識領域 G_cap から中間ゴール g をサンプリングする。
2 段階の Go-Explore ロールアウトでサンプリングされたゴールへ到達するようにゴール条件付きポリシ pi^G を訓練する（Go フェーズは g へ、Explore フェーズは BC Explorer で）。
デモンストレーション領域を周囲とする想像的ロールアウトループを Dreamer 型の世界モデルと組み合わせ、時間的距離報酬 D_t(s,g) に導かれて訓練データを想像軌跡で補強する。
テスト時に現在の観察からもっともらしいゴールを推定するゴール予測器 P_phi を導入し、真の最終ゴールなしでの一般化を可能にする。

Figure 1: Illustration of the Cago. Left: Directly setting the final goal as the agent’s target often leads to failure, as the current policy $\pi^{G}$ may not yet be capable of reaching it. The shaded region illustrates the set of states currently reachable under $\pi^{G}$ . Attempting to reach $g_

実験結果

リサーチクエスチョン

RQ1Cago はデモンストレーションを異なる方法で用いる既存の模倣学習ベースラインより優れているだろうか？
RQ2能力認識型ゴールサンプリングはエージェントの学習進捗と整合し、訓練効率を改善できるか？
RQ3能力認識型ゴールサンプリングと BC-Explorer コンポーネントは Cago の性能にどれほど必須か？

主な発見

Cago は MetaWorld の非常に難しいタスクで最終性能と学習効率の両方で一貫してベースラインを上回る。
Adroit タスクでは、長時間の訓練後に最終性能が高くなり、同等の Dreamer ベースのアプローチを上回る。
ManiSkill タスクでは、与えられたデモンストレーションで高い成功を達成できる唯一の方法として Cago が示される。
アブレーション実験では、能力認識型ゴールサンプリングまたは BC-Explorer を削除すると性能が大幅に低下し、それらの重要性を強調する。

Figure 2: The workflow of the goal predictor $\mathcal{P}_{\phi}$ .

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。