Skip to main content
QUICK REVIEW

[Paper Review] Learning continuous control policies by stochastic value gradients

Nicolas Heess, Greg Wayne|arXiv (Cornell University)|Dec 7, 2015
Reinforcement Learning in Robotics31 references286 citations
TL;DR

This paper introduces a unified framework for learning continuous control policies via stochastic value gradients, treating stochasticity in the Bellman equation as a deterministic function of exogenous noise. By enabling end-to-end backpropagation through models, value functions, and policies, the method achieves state-of-the-art performance in simulation, with SVG(1) demonstrating effective joint learning of dynamics, value functions, and policies in continuous control tasks.

ABSTRACT

We present a unified framework for learning continuous control policies using backpropagation. It supports stochastic control by treating stochasticity in the Bellman equation as a deterministic function of exogenous noise. The product is a spectrum of general policy gradient algorithms that range from model-free methods with value functions to model-based methods without value functions. We use learned models but only require observations from the environment instead of observations from model-predicted trajectories, minimizing the impact of compounded model errors. We apply these algorithms first to a toy stochastic control problem and then to several physics-based control problems in simulation. One of these variants, SVG(1), shows the effectiveness of learning models, value functions, and policies simultaneously in continuous domains.

Motivation & Objective

  • To unify model-free and model-based reinforcement learning in continuous control via a single differentiable framework.
  • To address the compounding error issue in model-based RL by learning from real environment observations instead of model-predicted trajectories.
  • To enable end-to-end backpropagation through stochastic policies, value functions, and learned dynamics models.
  • To develop a scalable and effective algorithm for continuous control that combines the benefits of value-based and model-based methods.

Proposed method

  • Treat stochasticity in the Bellman equation as a deterministic function of exogenous noise, enabling backpropagation through stochastic policies.
  • Use a learned environment model to predict state transitions, but train using real observations rather than model-generated trajectories.
  • Formulate a stochastic value gradient that allows joint optimization of policy, value function, and model parameters via backpropagation.
  • Apply the framework to both model-free and model-based settings, with a unified algorithmic structure.
  • Use a reparameterization trick to enable gradient estimation through stochastic actions, ensuring differentiability.
  • Introduce SVG(1), a variant that jointly learns dynamics models, value functions, and policies in a single end-to-end training process.

Experimental results

Research questions

  • RQ1Can a unified framework effectively combine model-based and model-free reinforcement learning in continuous control?
  • RQ2How can stochasticity in policies be handled efficiently within a differentiable reinforcement learning framework?
  • RQ3Can joint learning of dynamics models, value functions, and policies reduce the impact of model error in continuous control?
  • RQ4What performance gains are achievable by end-to-end training of all components via backpropagation?
  • RQ5How does the method compare to existing model-free and model-based approaches in complex control tasks?

Key findings

  • The proposed framework enables end-to-end training of policies, value functions, and dynamics models using backpropagation, achieving stable and efficient learning.
  • By using real environment observations instead of model-predicted trajectories, the method minimizes the compounding effect of model errors.
  • SVG(1), a variant of the framework, achieves strong performance in continuous control tasks, demonstrating the effectiveness of joint learning.
  • The method successfully supports both model-free and model-based learning within a single unified algorithmic structure.
  • The approach shows robustness and scalability in simulation environments, including physics-based control problems.
  • The framework enables differentiable treatment of stochastic policies through exogenous noise, facilitating gradient-based optimization.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.