[论文解读] Who Deserves the Reward? SHARP: Shapley Credit-based Optimization for Multi-Agent System
SHARP 引入基于 Shapley 的分层归因与三方奖励设计,以稳定并提升在工具增强的多智能体大语言模型系统中的训练,显著优于单智能体及其他多智能体基线的表现。
Integrating Large Language Models (LLMs) with external tools via multi-agent systems offers a promising new paradigm for decomposing and solving complex problems. However, training these systems remains notoriously difficult due to the credit assignment challenge, as it is often unclear which specific functional agent is responsible for the success or failure of decision trajectories. Existing methods typically rely on sparse or globally broadcast rewards, failing to capture individual contributions and leading to inefficient reinforcement learning. To address these limitations, we introduce the Shapley-based Hierarchical Attribution for Reinforcement Policy (SHARP), a novel framework for optimizing multi-agent reinforcement learning via precise credit attribution. SHARP effectively stabilizes training by normalizing agent-specific advantages across trajectory groups, primarily through a decomposed reward mechanism comprising a global broadcast-accuracy reward, a Shapley-based marginal-credit reward for each agent, and a tool-process reward to improve execution efficiency. Extensive experiments across various real-world benchmarks demonstrate that SHARP significantly outperforms recent state-of-the-art baselines, achieving average match improvements of 23.66% and 14.05% over single-agent and multi-agent approaches, respectively.
研究动机与目标
- 解决工具集成的 MAS 中的信用分配挑战。
- 设计一个原理性奖励分解,单独识别各个智能体的贡献,同时确保全局任务对齐。
- 通过基于 Shapley 的边际信用与三方奖励框架,稳定并加速多智能体训练。
- 展示在不同基准和模型规模上的跨任务泛化与可扩展性。
提出的方法
- 提出 SHARP,一个基于 Shapley 的分层归因框架,具三方奖励设计:全局广播-准确性、边际信用(Shapley 基于)奖励,以及工具过程奖励。
- 使用反事实掩码机制,通过从轨迹中消除智能体来估计每个智能体的因果影响。
- 在轨迹分组中对智能体特定优势进行归一化,以实现低方差、连贯的梯度更新(组相对策略梯度)。
- 采用参数共享的自对弈设置,其中规划者与执行者通过角色提示从单一策略实例化。
- 通过轨迹消融的反事实信用公式近似 Shapley 值以获得边际信用:credit_i,m = R_acc(τ_i) − R_acc(τ_i ackslash m)。
- 使用剪切的代理目标(SHARP 目标)进行训练,在多条轨迹上聚合智能体级别的剪切优势。

实验结果
研究问题
- RQ1SHARP 相较于单智能体与多智能体基线在多样基准上的表现如何?
- RQ2边际信用建模对性能的影响为何,哪些组件贡献最大?
- RQ3在任务异质性、模型规模与训练预算下,SHARP 的稳定性与可扩展性如何?
- RQ4SHARP 如何影响规划者–执行者的协作与子智能体的有用性?
主要发现
| Method | MAS | TRN | BOR | MCR | MuSiQue | GAIA-text | WebWalkerQA | FRAMES | AVG |
|---|---|---|---|---|---|---|---|---|---|
| LLaMA-3.1-8B RAG | ✗ | ✗ | ✗ | ✗ | 7.20 | 8.82 | 0.77 | 5.81 | 5.65 |
| Qwen3-8B RAG | ✗ | ✗ | ✗ | ✗ | 8.60 | 15.40 | 1.23 | 6.78 | 8.00 |
| Plan-Search † | ✗ | ✗ | ✗ | ✗ | 26.66 | 10.04 | 3.32 | 10.76 | 12.70 |
| Plan-Search | ✗ | ✗ | ✗ | ✗ | 36.35 | 27.48 | 6.77 | 28.48 | 24.77 |
| Search-R1 ‡ | ✗ | ✓ | ✗ | ✗ | 18.11 | 14.15 | 2.30 | 11.30 | 11.47 |
| Single-agent GRPO | ✗ | ✓ | ✗ | ✗ | 45.93 | 27.97 | 7.47 | 30.20 | 27.89 |
| Planner–Worker † | ✓ | ✗ | ✗ | ✗ | 35.22 | 13.36 | 5.57 | 21.21 | 18.84 |
| Planner–Worker | ✓ | ✗ | ✗ | ✗ | 38.23 | 27.53 | 7.42 | 32.18 | 26.34 |
| G-Designer | ✓ | ✗ | ✗ | ✗ | 38.50 | 28.15 | 4.70 | 28.28 | 24.90 |
| CARD | ✓ | ✓ | ✗ | ✗ | 45.00 | 32.89 | 7.38 | 27.31 | 28.15 |
| COA | ✓ | ✓ | ✗ | ✗ | 44.28 | 32.00 | 7.22 | 32.10 | 28.90 |
| AceSearcher † | ✓ | ✓ | ✗ | ✗ | 36.41 | 20.05 | 7.04 | 27.38 | 22.72 |
| MATPO | ✓ | ✓ | ✗ | ✗ | 47.00 | 31.65 | 7.47 | 37.10 | 30.81 |
| SHARP † | ✓ | ✓ | ✓ | ✓ | 46.14 | 23.23 | 7.60 | 25.71 | 25.67 |
| SHARP | ✓ | ✓ | ✓ | ✓ | 50.76 | 33.70 | 8.50 | 37.29 | 32.56 |
- SHARP 的平均性能高于基线,在跨基准的单智能体匹配提升平均为 23.66%,多智能体提升为 14.05%。
- 边际信用建模始终带来最佳整体性能,超过了架构与优化策略。
- SHARP 能够随模型规模有效扩展(如 0.6B 到 8B 的骨干),在更大规模下带来更大优势(在 8B 骨干上提升可达 14.41 点)。
- 协作分析显示 SHARP 提高了规划者分数,并增加了有用子智能体的比例,同时减少有害交互。
- SHARP 展现了跨任务泛化(DocMath-Eval)及稳定、单调的训练改进,表明优化过程稳定。
- 消融分析表明联合的规划者–执行者信用带来协同增益,规划者信用细化分解,执行者信用提升执行与工具使用。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。