QUICK REVIEW

[논문 리뷰] Go-Explore: a New Approach for Hard-Exploration Problems

Adrien Ecoffet, Joost Huizinga|arXiv (Cornell University)|2019. 01. 30.

Reinforcement Learning in Robotics참고 문헌 99인용 수 225

한 줄 요약

Go-Explore은 유망한 상태를 기억하고 그것들로 탐색 없이 되돌아간 뒤 그곳에서 탐색하고, 모방 학습으로 최종적으로 강건화하는 두 단계 알고리즘을 도입하여 어려운 탐색 과제에서 Atari에서 인간을 능가하는 성능을 달성한다.

ABSTRACT

A grand challenge in reinforcement learning is intelligent exploration, especially when rewards are sparse or deceptive. Two Atari games serve as benchmarks for such hard-exploration domains: Montezuma's Revenge and Pitfall. On both games, current RL algorithms perform poorly, even those with intrinsic motivation, which is the dominant method to improve performance on hard-exploration domains. To address this shortfall, we introduce a new algorithm called Go-Explore. It exploits the following principles: (1) remember previously visited states, (2) first return to a promising state (without exploration), then explore from it, and (3) solve simulated environments through any available means (including by introducing determinism), then robustify via imitation learning. The combined effect of these principles is a dramatic performance improvement on hard-exploration problems. On Montezuma's Revenge, Go-Explore scores a mean of over 43k points, almost 4 times the previous state of the art. Go-Explore can also harness human-provided domain knowledge and, when augmented with it, scores a mean of over 650k points on Montezuma's Revenge. Its max performance of nearly 18 million surpasses the human world record, meeting even the strictest definition of "superhuman" performance. On Pitfall, Go-Explore with domain knowledge is the first algorithm to score above zero. Its mean score of almost 60k points exceeds expert human performance. Because Go-Explore produces high-performing demonstrations automatically and cheaply, it also outperforms imitation learning work where humans provide solution demonstrations. Go-Explore opens up many new research directions into improving it and weaving its insights into current RL algorithms. It may also enable progress on previously unsolvable hard-exploration problems in many domains, especially those that harness a simulator during training (e.g. robotics).

연구 동기 및 목표

희박하고 기만적인 보상 설정에서 탐색의 문제를 다룬다.
내재 동기에 과도하게 의존하지 않고 hard-exploration Atari 벤치마크에서 성능을 개선한다.
유망한 상태의 아카이브를 활용한 탐색 1단계와 모방 학습을 통한 강건화를 위한 2단계 프레임워크를 개발한다.

제안 방법

유망한 상태(셀)와 그것들에 도달하는 방법을 보관하는 아카이브를 저장한다.
각 단계에서 셀을 선택하고 결정적으로 그 위치로 되돌아간 뒤, 그 위치에서 확률적 행동으로 탐색한다.
새 셀이 발견되거나 더 나은 경로가 나타날 때 아카이브를 업데이트하며, 여기에는 궤적, 상태, 점수, 길이가 포함된다.
셀에 대해 두 가지 표현을 사용한다: 도메인 지식 없이 다운샘플링된 흑백 11x8 이미지와 도메인 지식이 풍부한 표현(예: 에이전트 위치, 방, 열쇠).
2단계는 모방 학습(Backward Algorithm)을 사용하여 1단계의 경로를 강건화하고, 경로의 말미에서부터부터 시작으로 이동하며 PPO로 원래 점수에 도달하거나 이를 초과할 때까지 학습한다.
이 방법은 결정적 훈련(Phase 1)과 확률적 평가를 구분하여 결정적 재설정을 가능하게 하고, 2단계에서 강건성을 위해 확률성을 추가한다.

실험 결과

연구 질문

RQ1사전에 방문한 상태의 명시적 아카이브와 탐색 전에 그것으로 되돌아가는 것이 어려운 탐색 작업의 성능을 향상시킬 수 있는가?
RQ2결정적 탐색을 모방 학습을 통한 이후의 강건화와 결합하면 희박하고 기만적인 보상 설정에서 확장 가능한 개선을 가져올 수 있는가?
RQ3셀 표현에서의 도메인 지식이 탐색의 발견과 성능을 얼마나 가속하는가?
RQ4Go-Explore은 전통적인 내재 동기 접근이 야기하는 분리(detachment)와 이탈(derailment) 문제를 어떻게 해결하는가?
RQ5도메인 지식 여부에 따라 Montezuma’s Revenge와 Pitfall에서의 성능 향상은 어떠한가?

주요 결과

Montezuma’s Revenge에서 도메인 지식 없이 Go-Explore은 43,000점을 넘는 점수를 기록(이전 최고치의 거의 4배).
쉽게 제공 가능한 도메인 지식을 사용할 때 Go-Explore은 평균 650,000점을 넘고 최대 1,800만점을 넘겨 인간 세계 기록을 크게 상회한다.
Pitfall에서 도메인 지식을 사용한 Go-Explore은 평균 약 59,494점, 최대 107,363점으로 게임의 가능한 최대치에 근접한다.
도메인 지식 없이 Montezuma’s Revenge의 평균 점수는 43,763으로 여전히 이전 연구보다 현저히 높다.
Go-Explore은 시연 학습에 적합한 고성능 시연을 자동으로 생성할 수 있어 인간 시연에 의존했던 이전 시연 학습 결과를 능가한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.