QUICK REVIEW

[논문 리뷰] Learning Multi-Level Hierarchies with Hindsight

Andrew Levy, George Konidaris|arXiv (Cornell University)|2017. 12. 04.

Reinforcement Learning in Robotics인용 수 75

한 줄 요약

본 논문은 Hierarchical Actor-Critic (HAC)를 소개한다. 이는 비정상성(non-stationarity)과 희소 보상을 극복하기 위해 hindsight action/goal 전이를 사용하여 여러 수준의 정책을 병렬로 학습하는 계층적 강화 학습 프레임워크로, 연속 상태/동작 공간에서의 효율적 학습을 가능하게 한다.

ABSTRACT

Hierarchical agents have the potential to solve sequential decision making tasks with greater sample efficiency than their non-hierarchical counterparts because hierarchical agents can break down tasks into sets of subtasks that only require short sequences of decisions. In order to realize this potential of faster learning, hierarchical agents need to be able to learn their multiple levels of policies in parallel so these simpler subproblems can be solved simultaneously. Yet, learning multiple levels of policies in parallel is hard because it is inherently unstable: changes in a policy at one level of the hierarchy may cause changes in the transition and reward functions at higher levels in the hierarchy, making it difficult to jointly learn multiple levels of policies. In this paper, we introduce a new Hierarchical Reinforcement Learning (HRL) framework, Hierarchical Actor-Critic (HAC), that can overcome the instability issues that arise when agents try to jointly learn multiple levels of policies. The main idea behind HAC is to train each level of the hierarchy independently of the lower levels by training each level as if the lower level policies are already optimal. We demonstrate experimentally in both grid world and simulated robotics domains that our approach can significantly accelerate learning relative to other non-hierarchical and hierarchical methods. Indeed, our framework is the first to successfully learn 3-level hierarchies in parallel in tasks with continuous state and action spaces.

연구 동기 및 목표

연속 의사 결정 태스크에서 학습 가속화를 위한 계층 구조의 사용을 고무한다.
비정상적인 전이에도 불구하고 여러 수준의 정책을 병렬로 학습하는 프레임워크를 개발한다.
희소 보상에서 안정적인 병렬 학습을 가능하게 하는 메커니즘(hindsight action/goal transitions 및 subgoal testing)을 제안한다.
격자 세계(grid world) 및 연속 로봇공학 도메인에서 2- 및 3-계층 구조로의 확장성을 입증한다.

제안 방법

Hierarchical Actor-Critic (HAC)를 제안하고, 단일 UMDP를 각 계층 수준에 대해 다중 중첩된 UMDP로 변환한다.
각 수준이 아래 수준을 위한 하위 목표를 출력하고 최종적으로 맨 아래 수준에서 기본 동작을 내놓는 목표 조건부 정책(goal-conditioned policies)을 사용한다.
상위 수준의 전이가 전체 하위 수준 정책 계층에 의존하는 중첩 전이 함수를 적용한다.
하위 수준 최적 계층 구조를 시뮬레이션하기 위한 hindsight action transitions를 도입하여 수준 간 학습의 안정을 도모한다.
희소 보상에 대해 계층적 설정으로 Hindsight Experience Replay를 확장하기 위해 hindsight goal transitions를 도입한다.
현재 하위 수준 정책에 의해 하위 목표가 달성 가능한지 확인하는 subgoal testing transitions를 추가하고 학습 신호의 균형을 맞춘다.

실험 결과

연구 질문

RQ1HAC는 이산 및 연속 도메인에서 병렬로 여러 수준의 정책을 학습할 수 있는가?
RQ2HAC가 3-레벨 계층 구조의 병렬 학습을 가능하게 하는가, 그리고 이것이 2-레벨 및 평평한 baselines와 어떻게 비교되는가?
RQ3hindsight action/goal transitions 및 subgoal testing transitions가 비정상성(non-stationarity)을 완화하고 학습 효율을 향상시키는가?
RQ4연속 로봇공학 작업에서 HAC의 HIRO에 대한 성능은 어떠한가?

주요 결과

HAC는 이산 및 연속 작업 전반에서 단순한(flat) 에이전트보다 상당히 우수한 성능을 보였다.
병렬로 학습된 3레벨 계층구조가 2레벨 계층구조를 능가했고, 그다음이 평면 학습을 능가했다.
실험에서 HAC는 세 가지 시뮬레이션 로봇 공학 과제에서 HIRO를 능가했다.
hindsight action 및 goal transitions, 더불어 subgoal testing은 안정적인 병렬 학습을 가능하게 하고 비정상적 전이에서의 문제를 완화한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.