QUICK REVIEW

[论文解读] A Deep Policy Inference Q-Network for Multi-Agent Systems

Zhang-Wei Hong, Shih-Yang Su|arXiv (Cornell University)|Dec 21, 2017

Reinforcement Learning in Robotics参考文献 33被引用 42

一句话总结

本文提出DPIQN，一种通过从合作者和对手的原始观测中推断策略特征，并将其作为隐藏向量整合以增强Q值预测的深度Q网络，从而提升多智能体强化学习性能。该模型在1v1和2v2足球环境中优于DQN和DRQN，尤其在策略动态变化时表现更优，DRPIQN在非平稳环境中展现出更优的稳定性和泛化能力。

ABSTRACT

We present DPIQN, a deep policy inference Q-network that targets multi-agent systems composed of controllable agents, collaborators, and opponents that interact with each other. We focus on one challenging issue in such systems---modeling agents with varying strategies---and propose to employ "policy features" learned from raw observations (e.g., raw images) of collaborators and opponents by inferring their policies. DPIQN incorporates the learned policy features as a hidden vector into its own deep Q-network (DQN), such that it is able to predict better Q values for the controllable agents than the state-of-the-art deep reinforcement learning models. We further propose an enhanced version of DPIQN, called deep recurrent policy inference Q-network (DRPIQN), for handling partial observability. Both DPIQN and DRPIQN are trained by an adaptive training procedure, which adjusts the network's attention to learn the policy features and its own Q-values at different phases of the training process. We present a comprehensive analysis of DPIQN and DRPIQN, and highlight their effectiveness and generalizability in various multi-agent settings. Our models are evaluated in a classic soccer game involving both competitive and collaborative scenarios. Experimental results performed on 1 vs. 1 and 2 vs. 2 games show that DPIQN and DRPIQN demonstrate superior performance to the baseline DQN and deep recurrent Q-network (DRQN) models. We also explore scenarios in which collaborators or opponents dynamically change their policies, and show that DPIQN and DRPIQN do lead to better overall performance in terms of stability and mean scores.

研究动机与目标

解决在非平稳多智能体系统（MAS）中建模具有不同策略的智能体的挑战，其中对手和合作者可能动态改变策略。
克服先前方法依赖智能体结构先验知识或基于规则假设的局限性，这些假设在现实场景中不切实际。
使可控智能体能够在不访问内部智能体逻辑的情况下，仅通过原始观测（如图像）学习有效策略。
通过一种自适应损失函数，优先学习策略特征而非Q值，从而提升多智能体设置下的训练稳定性和收敛速度。
在测试期间合作者或对手策略不可预测地改变的未见场景中，证明模型的泛化能力。

提出的方法

设计基于DQN的深度策略推断Q网络（DPIQN），包含三个模块：特征提取模块、Q值学习模块和辅助策略特征学习模块。
使用独立的网络分支从合作者和对手的原始观测（如图像）中学习策略特征，并将这些特征作为隐藏向量注入主DQN网络。
引入一种自适应损失函数，结合Q值损失$L^Q$与策略推断损失$L^{PI}$，并采用动态加权系数$\lambda$，在训练过程中逐步从策略特征学习转向Q值学习。
通过引入循环网络（LSTM）将DPIQN扩展为DRPIQN，以处理存在延迟或不完整观测的环境中的部分可观测性问题。
采用自适应训练程序，动态调整策略特征学习与Q值优化之间的注意力，提升训练稳定性和收敛性。
在表示学习中引入辅助任务，以丰富特征空间，从而更好地建模非平稳的合作者与对手。

实验结果

研究问题

RQ1深度强化学习智能体能否有效从多智能体系统中原始观测推断并利用合作者与对手的策略特征？
RQ2将学习到的策略特征作为隐藏向量注入后，相较于标准DQN和DRQN，该方法在Q值预测和整体智能体性能方面有何提升？
RQ3DPIQN与DRPIQN在合作者或对手在测试期间动态改变策略的场景中，其泛化能力达到何种程度？
RQ4所提出的自适应损失函数是否提升了多智能体强化学习设置下的训练稳定性和收敛速度？
RQ5在部分可观测环境下，DRPIQN（循环变体）相较于非循环的DPIQN表现如何？

主要发现

在1v1场景中，DPIQN在面对不熟悉对手时平均奖励达0.909，DRPIQN达0.947，显著优于基线DQN和DRQN。
在2v2场景中，DPIQN与DRPIQN在所有测试案例中均保持强劲表现，其中DRPIQN（O）在不熟悉对手设置下的平均奖励比DQN高出1.36倍。
DRPIQN因自适应损失函数的引入而展现出更优的稳定性和更快的收敛速度，该函数有效降低了训练过程中Q值损失$L^Q$的波动。
消融实验确认，策略推断损失$L^{PI}$与动态加权$\lambda$均至关重要——同时使用两者时模型收敛更快且损失波动更小。
在与不熟悉智能体协作时，DPIQN与DRPIQN智能体能独立取得更多进球，表明其在缺乏合作者意图先验知识下的鲁棒性。
模型对动态策略切换具有良好的泛化能力：当对手或合作者以每4–10个时间步的频率不可预测地改变策略时，DPIQN与DRPIQN仍保持高平均得分与稳定性，所有此类测试场景中均优于基线模型。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。