QUICK REVIEW

[Paper Review] Continuous Deep Q-Learning with Model-based Acceleration

Shixiang Gu, Timothy Lillicrap|arXiv (Cornell University)|Mar 2, 2016

Reinforcement Learning in Robotics39 references336 citations

TL;DR

The paper derives Continuous Q-Learning with Normalized Advantage Functions (NAF) for efficient off-policy learning in continuous actions and enhances it with imagination rollouts using locally fitted linear dynamics to accelerate sample efficiency.

ABSTRACT

Model-free reinforcement learning has been successfully applied to a range of challenging problems, and has recently been extended to handle large neural network policies and value functions. However, the sample complexity of model-free algorithms, particularly when using high-dimensional function approximators, tends to limit their applicability to physical systems. In this paper, we explore algorithms and representations to reduce the sample complexity of deep reinforcement learning for continuous control tasks. We propose two complementary techniques for improving the efficiency of such algorithms. First, we derive a continuous variant of the Q-learning algorithm, which we call normalized adantage functions (NAF), as an alternative to the more commonly used policy gradient and actor-critic methods. NAF representation allows us to apply Q-learning with experience replay to continuous tasks, and substantially improves performance on a set of simulated robotic control tasks. To further improve the efficiency of our approach, we explore the use of learned models for accelerating model-free reinforcement learning. We show that iteratively refitted local linear models are especially effective for this, and demonstrate substantially faster learning on domains where such models are applicable.

Motivation & Objective

Reduce the sample complexity of deep reinforcement learning for continuous control tasks.
Develop a Q-learning variant suitable for continuous actions that avoids dual actor-critic complexity.
Investigate model-based acceleration techniques that preserve model-free benefits.
Evaluate the proposed methods on simulated robotic control benchmarks.

Proposed method

Propose a continuous Q-learning variant (NAF) that decomposes Q(x,u) into V(x) + A(x,u) with A being quadratic in (u - mu(x)).
Parameterize the Q-function so that the maximizing action mu(x) is analytically obtainable.
Use a deep network to output V, mu, and a positive-definite matrix P(x) defining A via A(x,u) = -1/2 (u - mu(x))^T P(x) (u - mu(x)).
Train with standard deep Q-learning tooling: experience replay, target networks, and Bellman backups.
Introduce imagination rollouts: augment real experiences with synthetic on-policy rollouts from a learned local linear dynamics model to speed learning (Dyna-like).
Fit the dynamics locally as time-varying linear models and use short rollouts around sampled states to generate additional training data.

Experimental results

Research questions

RQ1Does the normalized advantage function (NAF) provide sample-efficient Q-learning in continuous action spaces compared to actor-critic methods like DDPG?
RQ2Can model-based imagination rollouts using locally fitted dynamics meaningfully accelerate model-free Q-learning without compromising final performance?
RQ3What is the impact of using true vs learned dynamics on the benefits of imagination rollouts?
RQ4How do off-policy planning signals (e.g., iLQG trajectories) compare to on-policy imagination rollouts for accelerating learning?
RQ5What are the limitations and sensitivity of the imagination rollout approach to imperfect dynamics models?

Key findings

Domains	DDPG reward	DDPG episodes	NAF reward	NAF episodes
Cartpole	-2.1	-0.601	420	-0.604	190
Reacher	-2.3	-0.509	1370	-0.331	1260
Peg	-11	-0.950	690	-0.438	130
Gripper	-29	1.03	2420	1.81	1920
GripperM	-90	-20.2	1350	-12.4	730
Canada2d	-12	-4.64	1040	-4.21	900
Cheetah	-0.3	8.23	1590	7.91	2390
Swimmer6	-325	-174	220	-172	190
Ant	-4.8	-2.54	2450	-2.58	1350
Walker2d	0.3	2.96	850	1.85	1530

NAF generally outperforms DDPG on many manipulation tasks, offering faster convergence and precision at target states.
On locomotion tasks, NAF and DDPG have more comparable performance, with NAF sometimes slightly better or worse depending on the domain.
Imagination rollouts with iteratively fitted time-varying linear dynamics substantially improve data efficiency (2–5x) for manipulation tasks like reacher and gripper.
Using true dynamics for imagination rollouts yields strong gains, whereas learned neural network dynamics can negate benefits; locally fitted linear models are preferred.
Off-policy iLQG exploration provides limited or inconsistent improvements over imagination rollouts alone; on-policy imagination rollouts are consistently beneficial.
Imagination rollouts are most beneficial in early learning; benefits may wane as the Q-function becomes more accurate, supporting a hybrid model-free finale.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.