QUICK REVIEW

[论文解读] Parametrized Deep Q-Networks Learning: Reinforcement Learning with Discrete-Continuous Hybrid Action Space

Jiechao Xiong, Qing Wang|arXiv (Cornell University)|Oct 10, 2018

Reinforcement Learning in Robotics参考文献 3被引用 151

一句话总结

引入 P-DQN，一种离策略深度 Q-network 变体，直接处理离散-连续混合动作空间，无需离散化或松弛，通过为每个离散动作学习从状态到连续参数的确定性映射，并联合训练 a Q-network 和 a parameterization policy。

ABSTRACT

Most existing deep reinforcement learning (DRL) frameworks consider either discrete action space or continuous action space solely. Motivated by applications in computer games, we consider the scenario with discrete-continuous hybrid action space. To handle hybrid action space, previous works either approximate the hybrid space by discretization, or relax it into a continuous set. In this paper, we propose a parametrized deep Q-network (P- DQN) framework for the hybrid action space without approximation or relaxation. Our algorithm combines the spirits of both DQN (dealing with discrete action space) and DDPG (dealing with continuous action space) by seamlessly integrating them. Empirical results on a simulation example, scoring a goal in simulated RoboCup soccer and the solo mode in game King of Glory (KOG) validate the efficiency and effectiveness of our method.

研究动机与目标

在游戏中存在离散-连续混合动作的环境中激发强化学习的研究动机。
开发一个框架，直接在混合动作上进行优化，而无需离散化或松弛。
引入一个可扩展的 off-policy 学习方法，将 Q-network 与确定性参数化策略结合起来。

提出的方法

将混合动作空间 A 定义为 {(k, x_k) | k in [K], x_k in X_k}，以及动作价值函数 Q(s, k, x_k)。
使用一个确定性策略 x_k = x_k(s; θ) 将状态映射到每个离散动作的连续参数。
在保持 Q-network Q(s, k, x_k; ω) 的同时，用相应的策略网络近似最优连续参数 x_k^Q(s)。
通过 ω 更新慢于 θ 的两步长随机近似训练，通过一个 n-step Bellman target y_t。
使用经验回放和 ε-greedy 探索，对 θ 和 ω 采用 off-policy 目标。
提供异步的 n-step P-DQN 变体以加速跨多个工作者的训练。

实验结果

研究问题

RQ1一个深度 Q-network 是否可以在不离散化或松弛的情况下扩展以处理离散-连续混合动作？
RQ2如何高效地联立学习每个动作的离散动作选择和连续参数化？
RQ3在混合动作任务中，所提的 P-DQN 在效率和效果上是否优于基于松弛或离散化的方法？

主要发现

P-DQN 直接在具有相关连续参数的离散动作上进行优化，避免对动作空间进行离散化或松弛。
经验结果表明，P-DQN 在模拟任务中比基于松弛的方法实现更快的收敛和更稳定的学习。
P-DQN 在 RoboCup soccer 和 King of Glory 实验中在效率和效果方面优于基线方法。
异步的 n-step P-DQN 变体可加速跨多个工作者的训练。
该方法将 DQN 和 DDPG 的思路整合，在离策略设置中处理混合动作。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。