[Paper Review] Continuous Deep Q-Learning with Model-based Acceleration
The paper derives Continuous Q-Learning with Normalized Advantage Functions (NAF) for efficient off-policy learning in continuous actions and enhances it with imagination rollouts using locally fitted linear dynamics to accelerate sample efficiency.
Model-free reinforcement learning has been successfully applied to a range of challenging problems, and has recently been extended to handle large neural network policies and value functions. However, the sample complexity of model-free algorithms, particularly when using high-dimensional function approximators, tends to limit their applicability to physical systems. In this paper, we explore algorithms and representations to reduce the sample complexity of deep reinforcement learning for continuous control tasks. We propose two complementary techniques for improving the efficiency of such algorithms. First, we derive a continuous variant of the Q-learning algorithm, which we call normalized adantage functions (NAF), as an alternative to the more commonly used policy gradient and actor-critic methods. NAF representation allows us to apply Q-learning with experience replay to continuous tasks, and substantially improves performance on a set of simulated robotic control tasks. To further improve the efficiency of our approach, we explore the use of learned models for accelerating model-free reinforcement learning. We show that iteratively refitted local linear models are especially effective for this, and demonstrate substantially faster learning on domains where such models are applicable.
Motivation & Objective
- Reduce the sample complexity of deep reinforcement learning for continuous control tasks.
- Develop a Q-learning variant suitable for continuous actions that avoids dual actor-critic complexity.
- Investigate model-based acceleration techniques that preserve model-free benefits.
- Evaluate the proposed methods on simulated robotic control benchmarks.
Proposed method
- Propose a continuous Q-learning variant (NAF) that decomposes Q(x,u) into V(x) + A(x,u) with A being quadratic in (u - mu(x)).
- Parameterize the Q-function so that the maximizing action mu(x) is analytically obtainable.
- Use a deep network to output V, mu, and a positive-definite matrix P(x) defining A via A(x,u) = -1/2 (u - mu(x))^T P(x) (u - mu(x)).
- Train with standard deep Q-learning tooling: experience replay, target networks, and Bellman backups.
- Introduce imagination rollouts: augment real experiences with synthetic on-policy rollouts from a learned local linear dynamics model to speed learning (Dyna-like).
- Fit the dynamics locally as time-varying linear models and use short rollouts around sampled states to generate additional training data.
Experimental results
Research questions
- RQ1Does the normalized advantage function (NAF) provide sample-efficient Q-learning in continuous action spaces compared to actor-critic methods like DDPG?
- RQ2Can model-based imagination rollouts using locally fitted dynamics meaningfully accelerate model-free Q-learning without compromising final performance?
- RQ3What is the impact of using true vs learned dynamics on the benefits of imagination rollouts?
- RQ4How do off-policy planning signals (e.g., iLQG trajectories) compare to on-policy imagination rollouts for accelerating learning?
- RQ5What are the limitations and sensitivity of the imagination rollout approach to imperfect dynamics models?
Key findings
| Domains | DDPG reward | DDPG episodes | NAF reward | NAF episodes | |
|---|---|---|---|---|---|
| Cartpole | -2.1 | -0.601 | 420 | -0.604 | 190 |
| Reacher | -2.3 | -0.509 | 1370 | -0.331 | 1260 |
| Peg | -11 | -0.950 | 690 | -0.438 | 130 |
| Gripper | -29 | 1.03 | 2420 | 1.81 | 1920 |
| GripperM | -90 | -20.2 | 1350 | -12.4 | 730 |
| Canada2d | -12 | -4.64 | 1040 | -4.21 | 900 |
| Cheetah | -0.3 | 8.23 | 1590 | 7.91 | 2390 |
| Swimmer6 | -325 | -174 | 220 | -172 | 190 |
| Ant | -4.8 | -2.54 | 2450 | -2.58 | 1350 |
| Walker2d | 0.3 | 2.96 | 850 | 1.85 | 1530 |
- NAF generally outperforms DDPG on many manipulation tasks, offering faster convergence and precision at target states.
- On locomotion tasks, NAF and DDPG have more comparable performance, with NAF sometimes slightly better or worse depending on the domain.
- Imagination rollouts with iteratively fitted time-varying linear dynamics substantially improve data efficiency (2–5x) for manipulation tasks like reacher and gripper.
- Using true dynamics for imagination rollouts yields strong gains, whereas learned neural network dynamics can negate benefits; locally fitted linear models are preferred.
- Off-policy iLQG exploration provides limited or inconsistent improvements over imagination rollouts alone; on-policy imagination rollouts are consistently beneficial.
- Imagination rollouts are most beneficial in early learning; benefits may wane as the Q-function becomes more accurate, supporting a hybrid model-free finale.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.