QUICK REVIEW

[논문 리뷰] Learning from Demonstrations via Capability-Aware Goal Sampling

Ye Duan, Yuning Wang|arXiv (Cornell University)|2026. 01. 13.

Reinforcement Learning in Robotics인용 수 0

한 줄 요약

Cago는 데모에서 학습을 안내하기 위한 능력 인식 목표 샘플링을 도입하여, 에이전트의 현재 능력 경계에 위치한 중간 목표를 형성하고 샘플 효율성과 장기 희소 작업에서의 최종 성능을 향상시킵니다. 이는 demonstration-aligned Go-Explore with a BC explorer and a world-model-based imagination loop를 사용합니다.

ABSTRACT

Despite its promise, imitation learning often fails in long-horizon environments where perfect replication of demonstrations is unrealistic and small errors can accumulate catastrophically. We introduce Cago (Capability-Aware Goal Sampling), a novel learning-from-demonstrations method that mitigates the brittle dependence on expert trajectories for direct imitation. Unlike prior methods that rely on demonstrations only for policy initialization or reward shaping, Cago dynamically tracks the agent's competence along expert trajectories and uses this signal to select intermediate steps--goals that are just beyond the agent's current reach--to guide learning. This results in an adaptive curriculum that enables steady progress toward solving the full task. Empirical results demonstrate that Cago significantly improves sample efficiency and final performance across a range of sparse-reward, goal-conditioned tasks, consistently outperforming existing learning from-demonstrations baselines.

연구 동기 및 목표

정확한 모방이 현실적으로 불가능한 장기-지향의 희소 보상 과제에서 모방 학습을 자극한다.
직접 모방이 아니라 목표 지향 학습을 뼈대 세우는 데 시연을 활용하는 프레임워크를 제안한다.
에이전트의 현재 능력 경계에 놓인 중간 목표를 샘플링하기 위한 능력 인식 메커니즘을 개발한다.
시연된 엔드포인트를 넘어 일반화하기 위해 테스트 시 자동 목표 추론을 가능하게 하는 goal predictor를 통해 자동화된 추론을 가능하게 한다.

제안 방법

시연을 구조화된 로드맵으로 표현하고 각 데모를 따라 도달할 수 있는 에이전트의 능력을 추적한다.
Dict_visit 방문 사전을 유지하여 에이전트가 어느 시연 관찰에 가까워졌는지 모니터링한다.
데모를 따라 에이전트의 현재 능력을 중심으로 한 능력 인식 영역 G_cap에서 중간 목표 g를 샘플링한다.
샘플링된 목표를 달성하기 위해 목표 조건부 정책 pi^G를 학습시키고, 두 단계의 Go-Explore 롤아웃을 사용한다(Go-phase toward g, Explore-phase with a BC Explorer).
시연 영역을 둘러싼 상상 경로를 보강하기 위해 Dreamer-스타일의 상상 롤아웃 루프를 도입하고, 시간-거리 보상 D_t(s,g)에 따라 훈련 데이터를 보강한다.
현재 관찰로부터 테스트 시 타당한 목표를 추론하는 goal predictor P_phi를 도입하여 실제 최종 목표 없이도 일반화가 가능하게 한다.

Figure 1: Illustration of the Cago. Left: Directly setting the final goal as the agent’s target often leads to failure, as the current policy $\pi^{G}$ may not yet be capable of reaching it. The shaded region illustrates the set of states currently reachable under $\pi^{G}$ . Attempting to reach $g_

실험 결과

연구 질문

RQ1Cago가 대안 방식으로 시연을 활용하는 기존의 모방 학습 기준선보다 우수한가요?
RQ2능력 인식 목표 샘플링이 에이전트의 진화하는 학습 진행 상황과 일치하여 학습 효율성을 향상시킬 수 있을까요?
RQ3능력 인식 목표 샘플링과 BC-Explorer 구성요소가 Cago의 성능에 얼마나 필수적인가요?

주요 결과

Cago는 MetaWorld의 매우 어려운 과제에서 최종 성능과 학습 효율성 면에서 지속적으로 베이스라인을 능가합니다.
Adroit 과제에서 Cago는 확장된 학습 후 더 높은 최종 성능에 도달하여 유사 Dreamer-기반 방법들을 능가합니다.
ManiSkill 과제에서 주어진 시연으로 높은 성공을 달성하는 유일한 방법으로 Cago가 나타났습니다.
구성요소 제거 실험에서 능력 인식 목표 샘플링이나 BC-Explorer를 제거하면 성능이 크게 저하되어 이들의 중요성을 강조합니다.

Figure 2: The workflow of the goal predictor $\mathcal{P}_{\phi}$ .

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.