QUICK REVIEW

[论文解读] Learning to Coordinate via Quantum Entanglement in Multi-Agent Reinforcement Learning

John Gardiner, Orlando Romero|arXiv (Cornell University)|Feb 9, 2026

Quantum Computing Algorithms and Architecture被引用 0

一句话总结

本文引入一个可微分框架，用于训练多智能体强化学习（MARL）代理，利用共享量子纠缠作为协同资源，实现超越共享随机性的无通信相关策略。在单轮博弈和通过一个与 MAPPO 集成的量子协调器/顾问架构的多智能体 Dec-POMDP 中展示量子优势。

ABSTRACT

The inability to communicate poses a major challenge to coordination in multi-agent reinforcement learning (MARL). Prior work has explored correlating local policies via shared randomness, sometimes in the form of a correlation device, as a mechanism to assist in decentralized decision-making. In contrast, this work introduces the first framework for training MARL agents to exploit shared quantum entanglement as a coordination resource, which permits a larger class of communication-free correlated policies than shared randomness alone. This is motivated by well-known results in quantum physics which posit that, for certain single-round cooperative games with no communication, shared quantum entanglement enables strategies that outperform those that only use shared randomness. In such cases, we say that there is quantum advantage. Our framework is based on a novel differentiable policy parameterization that enables optimization over quantum measurements, together with a novel policy architecture that decomposes joint policies into a quantum coordinator and decentralized local actors. To illustrate the effectiveness of our proposed method, we first show that we can learn, purely from experience, strategies that attain quantum advantage in single-round games that are treated as black box oracles. We then demonstrate how our machinery can learn policies with quantum advantage in an illustrative multi-agent sequential decision-making problem formulated as a decentralized partially observable Markov decision process (Dec-POMDP).

研究动机与目标

确立无通信合作型 MARL 的联合策略类别层次，突出量子纠缠作为比共享随机性更丰富的协同资源。
开发可微分的策略参数化，使对量子测量的端到端优化成为可能。
提出一种基于建议的策略架构，将联合策略分解为量子协调器与去中心化本地执行者。
证明学习的纠缠策略在单轮博弈和以 Dec-POMDP 风格的多智能体排队问题中实现量子优势。

提出的方法

引入 QuantumSoftmax，一种可微分变换，将复矩阵映射到有效的量子 POVM，支持对量子测量的梯度优化。
将共享纠缠策略表述为 pi(a|h)=tr(rho ⊗i M_i(ai|hi))，并证明其严格扩展了共享随机性策略。
提出基于顾问的策略架构，其中协调器采样量子建议输入 x，局部执行者在 x 的条件下执行，实现去中心化执行。
将该框架与改进的多智能体 PPO（MAPPO）结合，在量子纠缠约束下学习序列决策策略。
提供非局部博弈（REINFORCE）和 Dec-POMDP 设置的训练过程，包括熵正则化和基于 PPO 的目标函数。
讨论 q(x|h) 如何编码量子测量、共享随机性或其他协同信号，以及在去中心化方式下的采样实现。

Figure 1 : Hierarchy of policies. Here, $\bm{\Pi}_{\mathsf{F}}$ is the space of factorized policies, $\bm{\Pi}_{\mathsf{SR}}$ the space of shared randomness policies, $\bm{\Pi}_{\mathsf{Q}}$ the space of shared (quantum) entanglement policies, $\bm{\Pi}_{\mathsf{NS}}$ the space of non-signaling poli

实验结果

研究问题

RQ1在没有通信的情况下，基于学习的 MARL 能否利用共享的量子纠缠实现超越经典资源的协作？
RQ2如何在可微分 MARL 框架中对量子测量进行参数化和优化？
RQ3基于量子纠缠的协作策略是否在单轮非本地博弈和序列化的 Dec-POMDP 风格的问题中实现量子优势？
RQ4基于顾问的策略架构能否在仍可用梯度 RL 方法实现的前提下，将量子协调器与本地执行者有效分离？

主要发现

该框架在作为黑箱服务的单轮非本地博弈中学得具有量子优势的策略。
在 Dec-POMDP 风格的多路由/多服务器排队问题中学习到的纠缠策略在多个吞吐量设置下，等待时间低于已知的最佳共享随机性策略。
熵正则化有助于在非本地博弈中发现量子优势，避免收敛到经典最优的确定性策略。
对排队问题的实验与先前的理论结果一致，显示在无通信约束下，量子纠缠提升协同效果，相较于经典基线有所改进。

Figure 2 : Decentralized and parameterized implementation of a joint policy with shared quantum entanglement.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。