[Paper Review] Dueling Network Architectures for Deep Reinforcement Learning
This paper introduces the dueling network architecture for deep reinforcement learning, which decouples the state value function $V(s)$ and action advantage function $A(s,a)$ into separate streams that share a common feature encoder. By combining these streams to produce $Q(s,a) = V(s) + \left(A(s,a) - \frac{1}{|\mathcal{A}|}\sum_{a'} A(s,a')\right)$, the architecture enables more efficient and stable learning, especially in environments with many similar-valued actions. The method achieves state-of-the-art performance on the Atari 2600 benchmark when combined with prioritized experience replay.
In recent years there have been many successes of using deep representations in reinforcement learning. Still, many of these applications use conventional architectures, such as convolutional networks, LSTMs, or auto-encoders. In this paper, we present a new neural network architecture for model-free reinforcement learning. Our dueling network represents two separate estimators: one for the state value function and one for the state-dependent action advantage function. The main benefit of this factoring is to generalize learning across actions without imposing any change to the underlying reinforcement learning algorithm. Our results show that this architecture leads to better policy evaluation in the presence of many similar-valued actions. Moreover, the dueling architecture enables our RL agent to outperform the state-of-the-art on the Atari 2600 domain.
Motivation & Objective
- To improve policy evaluation in deep reinforcement learning by decoupling the estimation of state value and action advantage functions.
- To enable more efficient learning across actions, particularly in states with many similar or redundant actions.
- To design a neural network architecture that generalizes well across actions without modifying the underlying RL algorithm.
- To achieve superior performance on the Atari 2600 reinforcement learning benchmark compared to existing single-stream Q-networks.
Proposed method
- The dueling architecture uses two parallel streams: one estimating the state value function $V(s)$, and another estimating the state-action advantage function $A(s,a)$.
- Both streams share a common convolutional feature extraction module to learn shared representations from raw observations.
- The final $Q$-value is computed via $Q(s,a) = V(s) + \left(A(s,a) - \frac{1}{|\mathcal{A}|}\sum_{a'} A(s,a')\right)$, ensuring the advantage is normalized relative to the average action advantage.
- The network is trained using standard deep Q-learning with experience replay and target networks, without requiring changes to the learning algorithm.
- Saliency maps are computed using the Jacobian of the value and advantage streams with respect to input frames to visualize attention mechanisms.
- The architecture is combined with prioritized experience replay and gradient clipping to further improve sample efficiency and training stability.
Experimental results
Research questions
- RQ1Can decoupling value and advantage functions in deep Q-networks lead to more stable and efficient policy evaluation?
- RQ2Does the dueling architecture improve learning performance in environments with a large number of actions, especially when action values are similar?
- RQ3Can the dueling architecture generalize across actions without modifying the underlying reinforcement learning algorithm?
- RQ4How does the dueling architecture compare to standard single-stream Q-networks in terms of sample efficiency and final performance on the Atari 2600 benchmark?
Key findings
- The dueling architecture significantly improves policy evaluation in environments with many similar-valued actions, reducing instability caused by small value differences.
- The method achieves a mean human performance score of 591% and a median of 172% on the 57-game Atari 2600 benchmark when combined with prioritized experience replay.
- The saliency maps show that the value stream focuses on long-term state-relevant features (e.g., road horizon and score), while the advantage stream activates only when actions have immediate impact (e.g., nearby cars in Enduro).
- The dueling network outperforms both the single-stream DQN baseline and the prioritized DQN baseline, establishing a new state-of-the-art on the Atari 2600 domain.
- The architecture's frequent updating of the value stream leads to better approximation of $V(s)$, which enhances temporal-difference learning stability.
- The combination of dueling networks with prioritized replay and gradient clipping yields substantial performance gains, with the method showing robustness to noisy or small-value differences in action Q-values.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.