QUICK REVIEW

[論文レビュー] Learning to Coordinate via Quantum Entanglement in Multi-Agent Reinforcement Learning

John Gardiner, Orlando Romero|arXiv (Cornell University)|Feb 9, 2026

Quantum Computing Algorithms and Architecture被引用数 0

ひとこと要約

本論文は、マルチエージェント強化学習（MARL）エージェントを、共有乱数を超える通信不要の相関方策を可能にする協調資源として共有量子もつれを活用する微分可能なフレームワークで訓練する。単発ゲームとMAPPOと統合した量子コーディネータ/アドバイザーアーキテクチャを介したDec-POMDPタイプのマルチエージェントにおける量子優位性を示す。

ABSTRACT

The inability to communicate poses a major challenge to coordination in multi-agent reinforcement learning (MARL). Prior work has explored correlating local policies via shared randomness, sometimes in the form of a correlation device, as a mechanism to assist in decentralized decision-making. In contrast, this work introduces the first framework for training MARL agents to exploit shared quantum entanglement as a coordination resource, which permits a larger class of communication-free correlated policies than shared randomness alone. This is motivated by well-known results in quantum physics which posit that, for certain single-round cooperative games with no communication, shared quantum entanglement enables strategies that outperform those that only use shared randomness. In such cases, we say that there is quantum advantage. Our framework is based on a novel differentiable policy parameterization that enables optimization over quantum measurements, together with a novel policy architecture that decomposes joint policies into a quantum coordinator and decentralized local actors. To illustrate the effectiveness of our proposed method, we first show that we can learn, purely from experience, strategies that attain quantum advantage in single-round games that are treated as black box oracles. We then demonstrate how our machinery can learn policies with quantum advantage in an illustrative multi-agent sequential decision-making problem formulated as a decentralized partially observable Markov decision process (Dec-POMDP).

研究の動機と目的

通信不要な協調的MARLのための共同ポリシークラスの階層を確立し、量子もつれを共有乱数より豊かな協調資源として強調する。
量子測定をエンドツーエンド最適化可能な微分可能なポリシーパラメータを開発する。
共同ポリシーを量子コーディネータと分散型ローカルアクターへ分解するアドバイスベースのポリシーアーキテクチャを提案する。
単発ゲームとDec-POMDP風のマルチエージェント待ち行列問題の両方で量子優位性を達成する entangled policy の学習を実証する。

提案手法

QuantumSoftmaxを導入し、複素行列を有効な量子POVMへ写像する微分可能な変換を提供し、量子測定に対する勾配ベースの最適化を可能にする。
共有もつれポリシーを pi(a|h)=tr(rho ⊗i M_i(ai|hi)) として定式化し、それが共有乱数ポリシーを厳密に広げることを証明する。
コーディネータが量子助言入力 x をサンプルし、ローカルアクターが x に条件付けられて実行するアドバイザーベースのポリシーアーキテクチャを提示する。
量子もつれの制約の下で逐次意思決定ポリシーを学習するため、フレームワークを修正版のマルチエージェントPPO（MAPPO）と統合する。
非局所ゲーム（REINFORCE）とDec-POMDP設定の訓練手順を提供し、エントロピー正則化とPPOベースの目的を含む。
q(x|h) が量子測定、共有乱数、その他の協調信号を符号化できる方法、および分散的にサンプリングを行う方法について議論する。

Figure 1 : Hierarchy of policies. Here, $\bm{\Pi}_{\mathsf{F}}$ is the space of factorized policies, $\bm{\Pi}_{\mathsf{SR}}$ the space of shared randomness policies, $\bm{\Pi}_{\mathsf{Q}}$ the space of shared (quantum) entanglement policies, $\bm{\Pi}_{\mathsf{NS}}$ the space of non-signaling poli

実験結果

リサーチクエスチョン

RQ1通信なしで学習ベースのMARLエージェントが共有量子もつれを利用して古典的資源を超える協調を達成できるか？
RQ2 differentiable MARLフレームワークで量子測定をどのようにパラメータ化・最適化するか？
RQ3量子もつれベースの協調戦略は単発の非局所ゲームと逐次的なDec-POMDP様問題の双方で量子優位性を生むか？
RQ4アドバイザー基盤のポリシーアーキテクチャは、勾配ベースRL法と実装可能性を保ちながら量子コーディネータとローカルアクターを効果的に分離できるか？

主な発見

このフレームワークは、ブラックボックスオラクルとして扱われる単発の非局所ゲームで量子優位性を持つ戦略を学習する。
Dec-POMDP風のマルチルータ/マルチサーバ待ち行列問題で学習された entangled policy は、複数のスループット設定で既知の共有乱数戦略よりも待機時間を短縮する。
エントロピー正則化は非局所ゲームにおける量子優位性の発見を助け、従来の決定論的な古典的最適解へ収束するのを回避するのに役立つ。
待ち行列問題の実験は理論的結果を裏付け、古典的基準と比較して通信なし制約下での量子もつれが協調を改善することを示す。

Figure 2 : Decentralized and parameterized implementation of a joint policy with shared quantum entanglement.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。