QUICK REVIEW

[Paper Review] Memory-based control with recurrent neural networks

Nicolas Heess, Jonathan J. Hunt|arXiv (Cornell University)|Dec 14, 2015

Reinforcement Learning in Robotics33 references219 citations

TL;DR

This paper proposes Recurrent Deterministic Policy Gradient (RDPG) and Recurrent Stochastic Value Gradient (RSVG(0)) by extending model-free reinforcement learning algorithms with recurrent neural networks (RNNs) trained via backpropagation through time. The approach successfully solves partially observed control problems—such as sensor noise integration, system identification, long-term memory tasks, and the Morris water maze—directly from pixels, demonstrating that RNNs enable effective memory-based control in continuous control domains without requiring explicit belief states or hand-designed observation stacks.

ABSTRACT

Partially observed control problems are a challenging aspect of reinforcement learning. We extend two related, model-free algorithms for continuous control -- deterministic policy gradient and stochastic value gradient -- to solve partially observed domains using recurrent neural networks trained with backpropagation through time. We demonstrate that this approach, coupled with long-short term memory is able to solve a variety of physical control problems exhibiting an assortment of memory requirements. These include the short-term integration of information from noisy sensors and the identification of system parameters, as well as long-term memory problems that require preserving information over many time steps. We also demonstrate success on a combined exploration and memory problem in the form of a simplified version of the well-known Morris water maze task. Finally, we show that our approach can deal with high-dimensional observations by learning directly from pixels. We find that recurrent deterministic and stochastic policies are able to learn similarly good solutions to these tasks, including the water maze where the agent must learn effective search strategies.

Motivation & Objective

Address the challenge of partially observed control in continuous control domains where full state observability is absent.
Enable effective learning of memory-intensive policies in environments requiring short-term integration of noisy sensor data or long-term retention of information over many timesteps.
Demonstrate that model-free deep reinforcement learning with RNNs can solve complex memory-based tasks, including the Morris water maze, directly from pixel observations.
Investigate whether stochastic or deterministic policies perform better in partially observed settings when augmented with recurrent memory.
Explore the feasibility of end-to-end learning from high-dimensional observations such as raw pixels, without relying on handcrafted observation stacks or state representations.

Proposed method

Extend the Deterministic Policy Gradient (DPG) and Stochastic Value Gradient (SVG(0)) algorithms to use recurrent neural networks (RNNs) as the policy and value function approximators.
Train the RNN components using backpropagation through time (BPTT) to optimize policy and value function parameters based on temporal difference errors and policy gradients.
Integrate long short-term memory (LSTM) units into the RNN architecture to improve learning of long-term dependencies and mitigate vanishing gradient problems.
Use a separate actor-critic architecture where the actor network outputs actions based on the recurrent hidden state, and the critic network evaluates the Q-value of state-action pairs.
Apply the policy gradient update rule to the RNN parameters via the chain rule, enabling end-to-end training of the policy network with memory capacity.
Enable direct control from high-dimensional observations by combining convolutional neural networks (CNNs) with RNNs to extract visual features and maintain temporal memory.

Experimental results

Research questions

RQ1Can recurrent neural networks effectively encode and utilize long-term memory in partially observed continuous control tasks?
RQ2Does the integration of RNNs into model-free policy gradient algorithms like DPG and SVG(0) enable robust learning in environments with noisy or incomplete observations?
RQ3How do deterministic and stochastic recurrent policies compare in performance on memory-intensive control tasks such as the water maze?
RQ4Can RDPG and RSVG(0) learn effective control policies directly from raw pixel inputs without observation stacking or hand-designed state representations?
RQ5To what extent can RNN-based policies solve complex memory problems such as system identification and long-horizon planning in physical control domains?

Key findings

RDPG and RSVG(0) successfully solve a variety of partially observed control problems, including pendulum swing-up without velocity feedback, cart-pole swing-up with unknown pole length, and long-term memory tasks requiring delayed action execution.
The agents learned to integrate noisy sensor inputs over time, demonstrating effective short-term memory for state estimation in tasks like the pendulum and cartpole.
In the simplified Morris water maze, recurrent agents significantly reduced the time to reach the hidden platform on subsequent attempts, indicating successful long-term memory of the platform’s location.
RDPG achieved strong performance on vision-based tasks, learning to estimate velocity from static images and remember target positions in a disappearing target reaching task.
The performance of stochastic and deterministic recurrent policies was comparable across tasks, challenging the assumption that stochastic policies are inherently superior in partially observed settings.
The approach enabled direct control from high-dimensional pixel observations, showing that RNNs can learn to maintain relevant information across timesteps without explicit observation stacking.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.