QUICK REVIEW

[Paper Review] Never Give Up: Learning Directed Exploration Strategies

Adrià Puigdomènech Badia, Pablo Sprechmann|arXiv (Cornell University)|Feb 14, 2020

Reinforcement Learning in Robotics38 references80 citations

TL;DR

This paper introduces NGU, a reinforcement learning agent that learns a family of directed exploration policies using episodic and life-long novelty, trained with UVFA, achieving strong Atari results including non-zero rewards on Pitfall! without demonstrations.

ABSTRACT

We propose a reinforcement learning agent to solve hard exploration games by learning a range of directed exploratory policies. We construct an episodic memory-based intrinsic reward using k-nearest neighbors over the agent's recent experience to train the directed exploratory policies, thereby encouraging the agent to repeatedly revisit all states in its environment. A self-supervised inverse dynamics model is used to train the embeddings of the nearest neighbour lookup, biasing the novelty signal towards what the agent can control. We employ the framework of Universal Value Function Approximators (UVFA) to simultaneously learn many directed exploration policies with the same neural network, with different trade-offs between exploration and exploitation. By using the same neural network for different degrees of exploration/exploitation, transfer is demonstrated from predominantly exploratory policies yielding effective exploitative policies. The proposed method can be incorporated to run with modern distributed RL agents that collect large amounts of experience from many actors running in parallel on separate environment instances. Our method doubles the performance of the base agent in all hard exploration in the Atari-57 suite while maintaining a very high score across the remaining games, obtaining a median human normalised score of 1344.0%. Notably, the proposed method is the first algorithm to achieve non-zero rewards (with a mean score of 8,400) in the game of Pitfall! without using demonstrations or hand-crafted features.

Motivation & Objective

Motivate robust exploration in deep RL by learning controllable exploration strategies.
Develop an intrinsic reward that combines episodic and life-long novelty to sustain exploration.
Share a single neural network across multiple exploration-exploitation trade-offs via UVFA.
Demonstrate scalability in distributed RL settings with large actor pools.

Proposed method

Compute an intrinsic reward r^i_t combining episodic novelty (via k-nearest neighbors in an episodic memory of controllable states) and life-long novelty (via Random Network Distillation).
Learn a controllable state embedding f(x) with a self-supervised inverse dynamics objective to bias novelty toward controllable aspects of the environment.
Use a UVFA Q(x,a,β) to learn a family of policies with different exploration weights β, enabling a spectrum from pure exploration to exploitation.
Train with a distributed, off-policy approach (R2D2) using transformed Retrace double Q-learning loss and prioritized replay.
Embed the β conditioning, previous action, previous rewards, and β-specific signals into the agent’s input for each forward pass.

Experimental results

Research questions

RQ1Can a single neural network support multiple directed exploration policies with varying exploration/exploitation trade-offs?
RQ2Does combining episodic and life-long novelty produce durable exploration that persists across episodes and environments?
RQ3Can such exploration-driven policies improve performance on hard exploration games like Pitfall! without demonstrations?
RQ4How does NGU scale in distributed RL settings with many actors collecting experience in parallel?

Key findings

NGU achieves higher performance than strong Atari baselines on hard exploration games, with a median human-normalized score of 1344.0% across Atari-57.
NGU enables non-zero rewards in Pitfall! (mean score around 8,400) without demonstrations or hand-crafted features.
Increasing the number of mixtures N and using RND for life-long novelty improves performance on hard exploration games.
The approach yields competitive or superior results on several dense reward Atari games, though some settings (e.g., NGU with N>1 on certain games) may underperform compared to the best baselines.
Across Atari-57, NGU attains a median score of 1354.4% (vs 95% for Nature DQN, 1920.6% for R2D2, etc.), while maintaining strong performance on most games.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.