[Paper Review] Parametrized Deep Q-Networks Learning: Reinforcement Learning with Discrete-Continuous Hybrid Action Space
Introduces P-DQN, an off-policy deep Q-network variant that directly handles discrete-continuous hybrid action spaces without discretization or relaxation, by learning a deterministic mapping from states to continuous parameters for each discrete action and jointly training a Q-network and a parameterization policy.
Most existing deep reinforcement learning (DRL) frameworks consider either discrete action space or continuous action space solely. Motivated by applications in computer games, we consider the scenario with discrete-continuous hybrid action space. To handle hybrid action space, previous works either approximate the hybrid space by discretization, or relax it into a continuous set. In this paper, we propose a parametrized deep Q-network (P- DQN) framework for the hybrid action space without approximation or relaxation. Our algorithm combines the spirits of both DQN (dealing with discrete action space) and DDPG (dealing with continuous action space) by seamlessly integrating them. Empirical results on a simulation example, scoring a goal in simulated RoboCup soccer and the solo mode in game King of Glory (KOG) validate the efficiency and effectiveness of our method.
Motivation & Objective
- Motivate reinforcement learning in environments with discrete-continuous hybrid actions found in games.
- Develop a framework that directly optimizes over hybrid actions without discretization or relaxation.
- Attach a scalable off-policy learning method that integrates a Q-network with a deterministic parameterization policy.
Proposed method
- Define the hybrid action space A = {(k, x_k) | k in [K], x_k in X_k} and the action-value function Q(s, k, x_k).
- Use a deterministic policy x_k = x_k(s; θ) to map states to continuous parameters for each discrete action.
- Approximate the optimal continuous parameter x_k^Q(s) with a corresponding policy network while keeping a Q-network Q(s, k, x_k; ω).
- Train using a two-timescale stochastic approximation with ω updated more slowly than θ, via a n-step Bellman target y_t.
- Employ experience replay and ε-greedy exploration, with an off-policy objective for θ and ω.
- Provide asynchronous n-step P-DQN variants to speed up training across multiple workers.
Experimental results
Research questions
- RQ1Can a deep Q-network be extended to handle discrete-continuous hybrid actions without discretization or relaxation?
- RQ2How can we jointly learn the discrete action selection and the continuous parameterization for each action efficiently?
- RQ3Does the proposed P-DQN outperform relaxation-based or discretization-based methods in hybrid-action tasks?
Key findings
- P-DQN directly optimized over discrete actions with associated continuous parameters, avoiding the need to discretize or relax the action space.
- Empirical results show P-DQN achieving faster convergence and more stable learning than relaxation-based methods in simulated tasks.
- P-DQN outperforms baselines in RoboCup soccer and King of Glory experiments in terms of efficiency and effectiveness.
- Asynchronous n-step P-DQN variants accelerate training across multiple workers.
- The approach integrates ideas from DQN and DDPG to handle hybrid actions in an off-policy setting.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.