QUICK REVIEW

[논문 리뷰] AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

Ashvin Nair, Abhishek Gupta|arXiv (Cornell University)|2020. 06. 16.

Reinforcement Learning in Robotics참고 문헌 60인용 수 71

한 줄 요약

AWAC은 오프라인 데이터셈블리에서 학습하고, 배우자에 대한 암시적 제약을 사용하여 온라인에서 효율적으로 미세 조정을 수행하는 오프 폴리시 액터-비평가 알고리즘으로, 시演이나 서브최적 데이터로부터 rapid skill acquisition을 가능하게 한다.

ABSTRACT

Reinforcement learning (RL) provides an appealing formalism for learning control policies from experience. However, the classic active formulation of RL necessitates a lengthy active exploration process for each behavior, making it difficult to apply in real-world settings such as robotic control. If we can instead allow RL algorithms to effectively use previously collected data to aid the online learning process, such applications could be made substantially more practical: the prior data would provide a starting point that mitigates challenges due to exploration and sample complexity, while the online training enables the agent to perfect the desired skill. Such prior data could either constitute expert demonstrations or sub-optimal prior data that illustrates potentially useful transitions. While a number of prior methods have either used optimal demonstrations to bootstrap RL, or have used sub-optimal data to train purely offline, it remains exceptionally difficult to train a policy with offline data and actually continue to improve it further with online RL. In this paper we analyze why this problem is so challenging, and propose an algorithm that combines sample efficient dynamic programming with maximum likelihood policy updates, providing a simple and effective framework that is able to leverage large amounts of offline data and then quickly perform online fine-tuning of RL policies. We show that our method, advantage weighted actor critic (AWAC), enables rapid learning of skills with a combination of prior demonstration data and online experience. We demonstrate these benefits on simulated and real-world robotics domains, including dexterous manipulation with a real multi-fingered hand, drawer opening with a robotic arm, and rotating a valve. Our results show that incorporating prior data can reduce the time required to learn a range of robotic skills to practical time-scales.

연구 동기 및 목표

효율적으로 대규모 오프라인 데이터셋을 활용해 실제 로봇공학을 위한 정책을 사전 학습하는 것을 목표로 실용 RL을 촉진한다.
명시적 행동 정책 모델링을 요구하지 않고 오프라인 사전 학습과 온라인 미세 조정을 결합하는 간단하고 데이터 효율적인 알고리즘을 개발한다.
다양한 로봇 작업에서 이전 데이터의 도입이 온라인 학습 시간을 단축시키는 것을 Demonstrate한다.
부분적으로 최적이 아닌 오프라인 데이터에 대한 강건성을 평가하고 실제 적용 가능성을 Demonstrate한다.

제안 방법

TD 부트스트래핑을 통한 오프 폴리시 비평가 학습으로 Q^π( s, a )를 추정한다.
명시적 행동 모델 없이 KL-유사 암시적 제약 하에서 A^π_k(s,a)를 최대화하여 정책 개선을 수행한다.
닫힌 형식의 비파라메트릭한 액터 해 π*(a|s) ∝ π_β(a|s) exp(A^π_k(s,a)/λ)의 도출과 forward KL 최적화를 이용한 파라메트릭 정책으로의 사상.
신경망으로 액터와 비평가를 매개화하고 학습된 비평가의 가치로부터의 이점을 활용한 감독학습과 유사한 가중 최대 우도(Eq. 13)로 업데이트한다.
온라인 데이터 β와 오프라인 데이터를 포함하는 재생 메모리 버퍼를 사용; 오프라인 단계 후 온라인 데이터의 희소성 도입.
TD 부트스트래핑과 명시적 행동 모델 부재의 이점으로 알려진 메서드들을 AWR 및 ABM/MPO 유사 방법과 비교한다.

실험 결과

연구 질문

RQ1AWAC가 오프라인 사전 학습과 온라인 미세 조정을 효과적으로 결합하여 복잡한 로봇 제어任务를 학습할 수 있는가?
RQ2부분적으로 최적이거나 임의의 오프라인 데이터가 Demonstrations와 비교했을 때 AWAC의 성능에 어떤 영향을 미치는가?
RQ3명시적 행동 모델링을 피하면 온라인 미세 조정의 효율성과 안정성이 향상되는가?
RQ4고차원 희소 보상 로봇 작업에서 AWAC가 기존의 오프라인 및 온라인 RL 방법과 비교해 어떤 성능을 보이는가?

주요 결과

AWAC은 오프라인 데이터와 온라인 미세 조정에서 다양한 로봇 작업에 대해 빠르게 학습할 수 있게 하며, 정교한 조작 및 실제 실험으로 확장된다.
AWAC은 미세 조정 효율성에서 순수 오프라인 또는 순수 온라인 기반보다 우수하며, 온라인 데이터가 제한된 상황에서도 도전적인 작업을 해결한다(예: 펜 작업의 경우 120K 타임스텝).
방법은 Demonstrations, 부분적으로 최적의 데이터 또는 무작위 탐색 데이터도 알고리즘적 변경 없이 활용할 수 있으며 여전히 온라인 데이터 필요를 줄일 수 있다.
명시적 행동 정책 모델링을 피하면 AWAC이 이전의 오프라인 RL 접근법에 비해 보수적이지 않고 온라인 정제를 더 효과적으로 수행한다.
비평가에 대한 TD 부트스트래핑과 배우자에 대한 암시적 제약은 이 두 기능이 없는 변형들보다 더 나은 성능을 낳는 핵심 설계 선택이다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.