QUICK REVIEW

[논문 리뷰] Automatic Curriculum Learning through Value Disagreement

Yunzhi Zhang, Pieter Abbeel|arXiv (Cornell University)|2020. 06. 17.

Reinforcement Learning in Robotics참고 문헌 40인용 수 34

한 줄 요약

본 논문은 Value Disagreement Sampling (VDS)를 소개하여 가치 함수 앙상블을 이용해 학습 프런티어에서 목표를 샘플링함으로써 목표 조건 강화 학습의 커리큘럼을 자동으로 큐레이션한다.

ABSTRACT

Continually solving new, unsolved tasks is the key to learning diverse behaviors. Through reinforcement learning (RL), we have made massive strides towards solving tasks that have a single goal. However, in the multi-task domain, where an agent needs to reach multiple goals, the choice of training goals can largely affect sample efficiency. When biological agents learn, there is often an organized and meaningful order to which learning happens. Inspired by this, we propose setting up an automatic curriculum for goals that the agent needs to solve. Our key insight is that if we can sample goals at the frontier of the set of goals that an agent is able to reach, it will provide a significantly stronger learning signal compared to randomly sampled goals. To operationalize this idea, we introduce a goal proposal module that prioritizes goals that maximize the epistemic uncertainty of the Q-function of the policy. This simple technique samples goals that are neither too hard nor too easy for the agent to solve, hence enabling continual improvement. We evaluate our method across 13 multi-goal robotic tasks and 5 navigation tasks, and demonstrate performance gains over current state-of-the-art methods.

연구 동기 및 목표

다중 목표 RL의 자동 커리큘럼 학습을 통한 샘플 효율성 향상 동기 부여.
정보성 학습 신호를 제공하기 위한 프런티어-목표 샘플링 활용.
에피스테믹 불확실성을 가치 함수 앙상블로부터 사용한 목표 제안 모듈 개발.
다양한 로봇 및 네비게이션 태스크에서 방법의 효과성 시演력.

제안 방법

정책 의존 분포 C^π에서 목표를 샘플링하는 Goal Proposal Module 정의.
K개의 Q-함수 앙상블을 이용하여 목표 조건 Q-함수의 에피스테믹 불확실성 추정.
앙상블 불확실성으로부터 샘플링 분포를 계산하고 이에 따라 목표를 샘플링.
샘플링된 목표를 이용해 궤적을 수집하고 정책과 Q-함수를 표준 RL 업데이트로 모두 업데이트.
Sparse rewards를 다루기 위해 Hindsight Experience Replay (HER)와 통합; 기본 RL 알고리즘으로 DDPG를 평가.
Algorithm 1은 Value Disagreement Sampling (VDS)을 통한 커리큘럼 생성을 요약한다.

실험 결과

연구 질문

RQ1Value Disagreement Sampling (VDS)가 베이스라인 목표 조건 RL 방법들보다 샘플 효율성을 향상시키는가?
RQ2VDS로 샘플링된 목표가 학습을 향상시키는 정보성의 프런티어 스타일 도전을 나타내는가?
RQ3샘플링 함수, 앙상블 크기, 및 HER와의 결합과 같은 설계 선택에 대해 VDS의 견고성은 어떤가?

주요 결과

VDS는 조작 및 네비게이션 도메인을 포함한 18개의 희소 보상 태스크에서 샘플 효율성을 향상시킨다.
VDS는 학습 프런티어에서 목표를 샘플링하는 경향이 있어 정책 숙련도가 향상되면 harder한 목표로 이동한다.
VDS는 대부분의 평가 환경에서 HER, GoalGAN 및 기타 커리큘럼과 같은 베이스라인보다 우수한 성능을 보인다.
다양한 샘플링 함수와 앙상블 크기에서도 VDS의 효과가 지속되며 HER와의 결합으로 이점이 커진다.
VDS를 HER와 결합하면 보고된 실험에서 최상의 성능을 얻는다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.