QUICK REVIEW

[论文解读] Interpretable Failure Analysis in Multi-Agent Reinforcement Learning Systems

Risal Shefin, Debashis Gupta|arXiv (Cornell University)|Feb 8, 2026

Adversarial Robustness in Machine Learning被引用 0

一句话总结

这篇论文提出一个两阶段的基于梯度的框架，用于多智能体强化学习（MARL）中的可解释故障分析，识别真正的 Patient-0 并追踪失败如何通过学习到的协调传播，在 Simple Spread 和 StarCraft II 基准测试上表现出色。

ABSTRACT

Multi-Agent Reinforcement Learning (MARL) is increasingly deployed in safety-critical domains, yet methods for interpretable failure detection and attribution remain underdeveloped. We introduce a two-stage gradient-based framework that provides interpretable diagnostics for three critical failure analysis tasks: (1) detecting the true initial failure source (Patient-0); (2) validating why non-attacked agents may be flagged first due to domino effects; and (3) tracing how failures propagate through learned coordination pathways. Stage 1 performs interpretable per-agent failure detection via Taylor-remainder analysis of policy-gradient costs, declaring an initial Patient-0 candidate at the first threshold crossing. Stage 2 provides validation through geometric analysis of critic derivatives-first-order sensitivity and directional second-order curvature aggregated over causal windows to construct interpretable contagion graphs. This approach explains "downstream-first" detection anomalies by revealing pathways that amplify upstream deviations. Evaluated across 500 episodes in Simple Spread (3 and 5 agents) and 100 episodes in StarCraft II using MADDPG and HATRPO, our method achieves 88.2-99.4% Patient-0 detection accuracy while providing interpretable geometric evidence for detection decisions. By moving beyond black-box detection to interpretable gradient-level forensics, this framework offers practical tools for diagnosing cascading failures in safety-critical MARL systems.

研究动机与目标

在安全关键的 MARL 场景中说明可解释故障分析的必要性。
提出一个两阶段的梯度基框架以检测真正的故障源并验证传播路径。
提供可解释的传染图，概括故障的影响、放大和时序。
在多个环境和 MARL 算法上进行实证评估，展示高的 Patient-0 检出准确率和具有指导性的解释。

提出的方法

阶段 1：通过策略梯度成本的泰勒展开项分析来检测每个代理的策略不稳定性，以在第一阈值跨越处识别一个 Patient-0 候选者。
阶段 2：通过追踪上游影响，利用 critic 的一阶导数和方向性二阶曲率来构建一个可解释的传染图，以验证 Patient-0 候选者。
使用一个有向传染图，其中边缘总结影响强度、放大频率和传播时序。
计算基于流的指标如 G_{ij}、H_{ij}、D_{ij}，以区分加速与抑制的影响并揭示下游优先的误检。
在短期因果窗口上聚合信息，以给出边级摘要（IS、CR）和剧集级影响图。
提供一个干预协议，通过在关键时刻与稳健时刻的攻击比较来验证高影响边的因果作用。

(a) Stage 1: Taylor approximation error in all agents

实验结果

研究问题

RQ1Q1: 谁是真正的 Patient-0——第一个进入非鲁棒状态的代理？
RQ2Q2: 为什么可能会首先标记一个未被攻击的代理，追踪能否纠正这一错误识别？
RQ3Q3: 不稳定性如何通过系统学习到的协调路径在不同代理之间随时间传播？
RQ4Q4: 框架是否能够提供可解释的传染图，概括影响、放大和故障时序？

主要发现

Setting	Algorithm	Stage-1 Accuracy	Correction Rate	Combined Accuracy
SimpleSpread-3	MADDPG	95.7%	66.9%	98.6%
SimpleSpread-3	HATRPO	99.1%	66.7%	99.4%
SimpleSpread-5	MADDPG	88.1%	40.1%	92.8%
SimpleSpread-5	HATRPO	98.9%	48.6%	99.2%
SMAC (3s_v_3z)	MADDPG	84.0%	70.8%	88.2%
SMAC (3s_v_3z)	HATRPO	94.8%	67.7%	98.3%

阶段-1 的检测在各设置下实现了 88.2%–99.4% 的 Patient-0 识别准确率。
阶段-2 的纠正提高了准确性，在协同环境（如 SMAC）中有显著提升。
不稳定性占用率（IO）在许多情况下持续优于传统性能指标（AUC-Q、AUC-Reward、基于奖励的度量），大约领先 20 个百分点以上。
相比 MADDPG，HATRPO 通常能提供更干净的泰勒误差信号和更高的阶段-1 准确性，原因是梯度环境更平滑。
阶段-2 的回溯能够在下游优先的故障情景中有效还原真实的上游源，生成可解释的传染图。
在关键（加速）时刻的干预比在稳健时刻的干预对下游不稳定性具有显著更强的影响力，验证了影响度量的因果效用。

(b) Stage 1,2: Influence timeline from the detection time of Patient-0 to the detection of the last faulty agent

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。