QUICK REVIEW

[논문 리뷰] Self-Consistent Trajectory Autoencoder: Hierarchical Reinforcement Learning with Trajectory Embeddings

John D. Co-Reyes, YuXuan Liu|arXiv (Cornell University)|2018. 06. 07.

Reinforcement Learning in Robotics참고 문헌 28인용 수 67

한 줄 요약

SeCTAR는 상태 디코더와 잠재 조건 정책 디코더를 갖춘 궤적 수준 VAE를 사용해 궤적의 연속 잠재 공간을 학습하고, 잠재 공간에서의 모델 기반 계획을 가능하게 하여 장기-지향적이고 보상 희소한 과제에서 활용된다.

ABSTRACT

In this work, we take a representation learning perspective on hierarchical reinforcement learning, where the problem of learning lower layers in a hierarchy is transformed into the problem of learning trajectory-level generative models. We show that we can learn continuous latent representations of trajectories, which are effective in solving temporally extended and multi-stage problems. Our proposed model, SeCTAR, draws inspiration from variational autoencoders, and learns latent representations of trajectories. A key component of this method is to learn both a latent-conditioned policy and a latent-conditioned model which are consistent with each other. Given the same latent, the policy generates a trajectory which should match the trajectory predicted by the model. This model provides a built-in prediction mechanism, by predicting the outcome of closed loop policy behavior. We propose a novel algorithm for performing hierarchical RL with this model, combining model-based planning in the learned latent space with an unsupervised exploration objective. We show that our model is effective at reasoning over long horizons with sparse rewards for several simulated tasks, outperforming standard reinforcement learning methods and prior methods for hierarchical reasoning, model-based planning, and exploration.

연구 동기 및 목표

원시 행동이 아닌 궤적을 모델링함으로써 계층적 RL을 위한 표현 학습의 필요성을 제시한다.
시간적으로 확장되고 재사용 가능한 행동을 가능하게 하는 연속 잠재 공간의 기술을 제안한다.
일관성을 보장하고 계획을 가능하게 하기 위해 두 개의 헤드가 있는 디코더 프레임워크(state decoder and policy decoder)를 개발한다.
희소 보상을 다루기 위해 비지도 탐색 objective와 함께 잠재 공간에서 모델 기반 계획을 통합한다.

제안 방법

궤적 인코더 q_phi(z|tau)를 사용하여 궤적에 대한 변분 자동인코더 프레이밍를 확장한다.
잠재 변수 z에서 궤적을 생성하기 위해 상태 디코더 p_theta_SD(tau|z)를 사용한다.
잠재 궤적을 구현하기 위해 환경에서 실행되는 정책 디코더 p_theta_PD(a|s,z)를 도입한다.
KL(p_theta_PD(tau|z) || p_theta_SD(tau|z))를 최소화하고 ELBO를 최대화하여 디코더 간의 일관성을 강제한다.
상태 궤적에 재귀적 네트워크를 사용하고 정책 디코더는 순전파 네트워크로 학습한다.
상태 디코더를 닫힌 루프 동작의 예측 모델로 삼아 모델 예측 제어를 사용하여 잠재 공간에서 계획한다.

실험 결과

연구 질문

RQ1손으로 지정된 하위 목표나 이산적 기술 없이 궤적의 연속 잠재 공간을 학습할 수 있는가?
RQ2궤적 수준 VAE와 잠재 조건 정책의 공동 학습이 장기적으로 신뢰할 수 있는 계획을 가능하게 하는가?
RQ3엔트로피 기반 탐색 목표에 의해 보조되는 잠재 공간의 모델 기반 계획이 희소 보상 과제에서 성능을 향상시키는가?
RQ4상태 디코더가 고수준 잠재 행동에 대한 의미 있는 결과 예측을 제공하는가?
RQ5장기 과제에서 SeCTAR가 기존의 모델 프리, 모델 기반 및 계층적 RL 방법과 어떻게 비교되는가?

주요 결과

SeCTAR는 확장된 궤적에 대한 계획을 가능하게 하고, 장기-지향적이며 보상 희소한 과제에서 여러 기준선보다 우수하다.
잠재 공간 MPC 플래너는 상태 디코더를 궤적 예측기로 사용하여 보상을 극대화하는 잠재 행동을 선택한다.
공동 학습은 일관된 상태 디코더와 정책 디코더를 만들어 닫힌 루프 계획과 더 나은 탐색을 가능하게 한다.
궤적 주변 엔트로피에 의해 가이드되는 비지도 탐색은 상태 공간 커버리지와 탐색 품질을 향상시킨다.
잠재 공간의 보간은 일관된 궤적을 생성하여 의미 있고 일반화 가능한 잠재 표현을 시사한다.
SeCTAR는 테스트된 과제에서 TRPO, A3C, VIME, FeUdal Networks, 및 option-critic보다 더 높은 성능과 샘플 효율성을 달성한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.