QUICK REVIEW

[논문 리뷰] Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration

Desik Rengarajan, Gargi Vaidya|arXiv (Cornell University)|2022. 02. 09.

Reinforcement Learning in Robotics인용 수 26

한 줄 요약

LOGO는 오프라인 시演 데이터를 활용하여 온라인 TRPO 학습을 희소 보상 RL에서 안내하고, 불완전한 관찰 하에서도 거의 최적의 성능과 견고한 완성을 달성한다.

ABSTRACT

A major challenge in real-world reinforcement learning (RL) is the sparsity of reward feedback. Often, what is available is an intuitive but sparse reward function that only indicates whether the task is completed partially or fully. However, the lack of carefully designed, fine grain feedback implies that most existing RL algorithms fail to learn an acceptable policy in a reasonable time frame. This is because of the large number of exploration actions that the policy has to perform before it gets any useful feedback that it can learn from. In this work, we address this challenging problem by developing an algorithm that exploits the offline demonstration data generated by a sub-optimal behavior policy for faster and efficient online RL in such sparse reward settings. The proposed algorithm, which we call the Learning Online with Guidance Offline (LOGO) algorithm, merges a policy improvement step with an additional policy guidance step by using the offline demonstration data. The key idea is that by obtaining guidance from - not imitating - the offline data, LOGO orients its policy in the manner of the sub-optimal policy, while yet being able to learn beyond and approach optimality. We provide a theoretical analysis of our algorithm, and provide a lower bound on the performance improvement in each learning episode. We also extend our algorithm to the even more challenging incomplete observation setting, where the demonstration data contains only a censored version of the true state observation. We demonstrate the superior performance of our algorithm over state-of-the-art approaches on a number of benchmark environments with sparse rewards and censored state. Further, we demonstrate the value of our approach via implementing LOGO on a mobile robot for trajectory tracking and obstacle avoidance, where it shows excellent performance.

연구 동기 및 목표

RL에서 희소 보상 신호 하에서의 학습 도전 과제 해결.
부분적으로 최적의 정책의 오프라인 시演 데이터를 활용해 온라인 학습을 부트스트랩하고 안내.
정책 개선과 시연 기반 정책 선택을 결합하는 두 단계 LOGO 프레임워크 개발.
성능 개선에 대한 이론적 보장을 제공하고 불완전 관찰 설정으로 확장.
MuJoCo 벤치마크와 실제 로봇 실험 (TurtleBot)에서의 효과성 시연.

제안 방법

정책 개선 단계에 TRPO를 사용하여 후보 정책을 생성.
후보 정책 주변의 신뢰 구역 내에서 오프라인 행동 정책에 근접한 정책을 탐색하는 정책 가이드 단계를 추가.
중간 정책의 샘플을 사용하여 정책 의존 KL 발산을 근사하는 대리 목표를 도입.
대리 목표를 뒷받침하기 위해 정책 의존 보상에 대한 이론적 보장을 위한 정책 의존 보상 차이 보정 렙 확장(성능 차이 보정 보조 정리).
구현 가능한 업데이트를 제공하는 Taylor 급수 기반의 두 TRPO 유사 업데이트.
상태를 투영하고 부분 데이터로부터 정책 의존 보상을 추정하기 위한 판별기(discriminator) 학습으로 불완전 관찰에 LOGO 확장.

실험 결과

연구 질문

RQ1LOGO가 오프라인 시演을 사용하여 희소 보상 환경에서 순수 TRPO에 비해 성능 개선을 달성할 수 있는가?
RQ2부분적으로 최적의 행동 정책의 가이드가 탐색 및 샘플 효율성에 어떤 영향을 미치는가?
RQ3학습 에피소드당 성능 개선에 대한 이론적 보장은 무엇인가?
RQ4불완전한 상태 관찰 설정에서도 효과를 유지하며 LOGO를 확장할 수 있는가?
RQ5MuJoCo 벤치마크의 결과가 Gazebo/TurtleBot 등 실제 로봇 과제로 확산되는가?

주요 결과

LOGO는 희소 보상 환경에서 빠른 학습과 거의 최적의 성능을 달성하며, baseline TRPO 및 모방 학습 접근법과 비교해 더 우수한 성능을 보인다.
두 단계 LOGO 절차(정책 개선 + 정책 가이드)가 형식적 성능 보장을 제공하고 행동 정책 가이드를 통해 초기 학습을 가속한다.
LOGO는 희소 보상에도 불구하고 표준 MuJoCo 벤치마크에서 치밀한 보상 최적화 알고리즘의 성능에 근접할 수 있다.
불완전 관찰 설정에 대해 정책 의존 보상에 대한 대리 목표를 갖는 판별기 기반 방법으로 확장하더라도 강한 성능을 유지한다.
LOGO가 Gazebo에서 TurtleBot의 경유지 추적 및 장애물 회피를 효과적으로 시현하며 실제 실험에서도 성능을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.