QUICK REVIEW

[論文レビュー] Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems

Risal Shefin, Debashis Gupta|arXiv (Cornell University)|Feb 8, 2026

Adversarial Robustness in Machine Learning被引用数 0

ひとこと要約

この論文は、MARLにおける解釈可能な故障分析のための2段階勾配ベースのフレームワークを提示し、真のPatient-0を特定し、学習された協調を介した故障の伝播を追跡する。Simple SpreadとStarCraft IIのベンチマークで高い性能を示す。

ABSTRACT

Multi-Agent Reinforcement Learning (MARL) is increasingly deployed in safety-critical domains, yet methods for interpretable failure detection and attribution remain underdeveloped. We introduce a two-stage gradient-based framework that provides interpretable diagnostics for three critical failure analysis tasks: (1) detecting the true initial failure source (Patient-0); (2) validating why non-attacked agents may be flagged first due to domino effects; and (3) tracing how failures propagate through learned coordination pathways. Stage 1 performs interpretable per-agent failure detection via Taylor-remainder analysis of policy-gradient costs, declaring an initial Patient-0 candidate at the first threshold crossing. Stage 2 provides validation through geometric analysis of critic derivatives-first-order sensitivity and directional second-order curvature aggregated over causal windows to construct interpretable contagion graphs. This approach explains "downstream-first" detection anomalies by revealing pathways that amplify upstream deviations. Evaluated across 500 episodes in Simple Spread (3 and 5 agents) and 100 episodes in StarCraft II using MADDPG and HATRPO, our method achieves 88.2-99.4% Patient-0 detection accuracy while providing interpretable geometric evidence for detection decisions. By moving beyond black-box detection to interpretable gradient-level forensics, this framework offers practical tools for diagnosing cascading failures in safety-critical MARL systems.

研究の動機と目的

安全 critical MARL設定における解釈可能な故障分析の必要性を動機づける。
真の故障源を検出し伝播経路を検証する2段階の勾配ベースフレームワークを提案する。
影響、増幅、故障のタイミングを要約する解釈可能な伝染グラフを提供する。
複数の環境とMARLアルゴリズムで方法を経験的に評価し、Patient-0検出精度と実用的な説明性を高める。

提案手法

ステージ1: Policy-gradientコストのTaylor remainder解析を用いて各エージェントのポリシー不安定性を検出し、最初の閾値クロスでPatient-0候補を同定する。
ステージ2: Critic微分（一次）と方向性二次曲率を用いて上流影響を追跡し、解釈可能な伝染グラフを構築してPatient-0候補を検証する。
影響強度、増幅頻度、伝播タイミングを要約する有向伝染グラフを使用する。
流れベースの指標G_{ij}, H_{ij}, D_{ij}を計算し、加速する影響と減衰する影響を識別し、下流先行の誤検出を暴露する。
短い因果ウィンドウ全体で情報を集約し、エッジレベルの要約（IS, CR）とエピソードレベルの影響グラフを作成する。
高い影響エッジの因果的役割を検証する介入プロトコルを提供し、重大な瞬間と頑健な瞬間での攻撃を比較する。

(a) Stage 1: Taylor approximation error in all agents

実験結果

リサーチクエスチョン

RQ1Q1: 真のPatient-0とは誰か—非堅牢状態に入った最初のエージェントは誰か？
RQ2Q2: なぜ非攻撃エージェントが最初にフラグ付けされることがあり、 tracebackはこの誤特定を訂正できるか？
RQ3Q3: 不安定性はシステムの学習された協調経路を通じて時間とともにエージェント間でどのように伝播するのか？
RQ4Q4: フレームワークは影響、増幅、故障のタイミングを要約する解釈可能な伝染グラフを提供できるか？

主な発見

Setting	Algorithm	Stage-1 Accuracy	Correction Rate	Combined Accuracy
SimpleSpread-3	MADDPG	95.7%	66.9%	98.6%
SimpleSpread-3	HATRPO	99.1%	66.7%	99.4%
SimpleSpread-5	MADDPG	88.1%	40.1%	92.8%
SimpleSpread-5	HATRPO	98.9%	48.6%	99.2%
SMAC (3s_v_3z)	MADDPG	84.0%	70.8%	88.2%
SMAC (3s_v_3z)	HATRPO	94.8%	67.7%	98.3%

ステージ1の検出は、設定全体で88.2%–99.4%のPatient-0識別精度を達成。
ステージ2の補正は精度を改善し、協調環境（例：SMAC）で顕著な向上を示す。
Instability Occupancy (IO)は、多くの場合、従来の性能指標（AUC-Q、AUC-Reward、報酬ベースの指標）を約20ポイント以上上回る。
MADDPGよりもHATRPOは一般にTaylor誤差信号がクリーンで、ステージ1の精度が高い—より滑らかな勾配地形のため。
ステージ2のtracebackは、下流優先の故障エピソードで真の上流源を効果的に回復し、解釈可能な伝染グラフを生成する。
重要な（加速する）瞬間での介入は頑健な瞬間での介入より下流の不安定性を有意に強くし、影響指標の因果的有用性を検証する。

(b) Stage 1,2: Influence timeline from the detection time of Patient-0 to the detection of the last faulty agent

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。