QUICK REVIEW

[논문 리뷰] #Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning

Haoran Tang, Rein Houthooft|arXiv (Cornell University)|2016. 11. 15.

Reinforcement Learning in Robotics참고 문헌 35인용 수 343

한 줄 요약

이 논문은 해시를 사용해 고차원 공간에서의 카운트 기반 탐색을 확장하고, 연속 제어 및 Atari 벤치마크에서 거의 최첨단 수준의 결과를 보여준다.

ABSTRACT

Count-based exploration algorithms are known to perform near-optimally when used in conjunction with tabular reinforcement learning (RL) methods for solving small discrete Markov decision processes (MDPs). It is generally thought that count-based methods cannot be applied in high-dimensional state spaces, since most states will only occur once. Recent deep RL exploration strategies are able to deal with high-dimensional continuous state spaces through complex heuristics, often relying on optimism in the face of uncertainty or intrinsic motivation. In this work, we describe a surprising finding: a simple generalization of the classic count-based approach can reach near state-of-the-art performance on various high-dimensional and/or continuous deep RL benchmarks. States are mapped to hash codes, which allows to count their occurrences with a hash table. These counts are then used to compute a reward bonus according to the classic count-based exploration theory. We find that simple hash functions can achieve surprisingly good results on many challenging tasks. Furthermore, we show that a domain-dependent learned hash code may further improve these results. Detailed analysis reveals important aspects of a good hash function: 1) having appropriate granularity and 2) encoding information relevant to solving the MDP. This exploration strategy achieves near state-of-the-art performance on both continuous control tasks and Atari 2600 games, hence providing a simple yet powerful baseline for solving MDPs that require considerable exploration.

연구 동기 및 목표

고차원 공간에서의 딥 RL을 위한 강건한 탐색 동기 부여.
해시를 사용하여 카운트 기반 탐색을 연속/복잡한 상태 공간으로 일반화.
도전적인 딥 RL 벤치마크에서 간단한 해시 기반 탐색 평가.
해시 기반 탐색을 효과적으로 만드는 요인(세분화도, 관련 정보) 분석.
일반 DRL 알고리즘과 호환되는 빠르고 유연한 베이스라인 제공.

제안 방법

상태 공간을 해시 함수로 이산화하여 n(φ(s))를 얻고 보상에 β / sqrt(n(φ(s)))의 보너스를 추가한다.
연속 상태에 대해 실용적이고 확장 가능한 해싱 방법으로 SimHash를 사용한다(알고리즘 1).
학습된 해시를 탐구하기 위해 오토인코더를 학습시켜 이진 코드를 생성하고, 이를 적용한 SimHash로 φ(s)를 얻는다(알고리즘 2).
해시 품질을 개선하기 위해 도메인 지식의 사전처리(정적 특징인 BASS 등) 또는 학습된 표현을 선택적으로 적용한다.
rllab 연속 제어 벤치마크와 Atari 2600 게임에서 TRPO로 평가하고, 베이스라인 및 기존 탐색 방법과 비교한다.

실험 결과

연구 질문

RQ1카운트 기반 탐색이 해시를 통해 연속 제어 및 Atari 벤치마크에서 성능 향상을 가져오는가?
RQ2정적 해시 코드와 학습된 해시 코드가 이미지 기반 관측의 탐색 성능에 어떤 영향을 미치는가?
RQ3탐색에 기여하는 해시 함수의 특성(세분화도, 정보성 인코딩)은 무엇인가?
RQ4해시 기반 탐색은 최첨단 딥 RL 탐색 방법과 어떻게 비교되는가?

주요 결과

해시를 이용한 해시 기반 탐색이 여러 벤치마크에서 거의 최첨단 성능에 도달한다.
정적 SimHash와 학습된 해시(AE-SimHash)가 다수의 Atari 게임과 연속 제어 과제에서 기본 TRPO를 능가할 수 있다.
도메인 의존적 사전처리(BASS) 또는 학습된 해시 코드는 특히 Montezuma’s Revenge와 Venture와 같은 게임에서 상당한 향상을 낼 수 있다.
간단한 해싱 접근은 복잡한 내적 동기부여 스킴 없이도 강력한 탐색 신호를 제공할 수 있다.
이 접근은 빠르고 유연하며 기존 DRL 알고리즘과 보완적이다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.