[论文解读] Multi-Agent Teams Hold Experts Back
The paper shows that self-organizing multi-agent LLM teams consistently fail to harness expert members, underperforming the best individual by 8.1%–37.6% across psychology tasks and ML benchmarks, due to how they leverage (or fail to leverage) expertise.
Multi-agent LLM systems are increasingly deployed as autonomous collaborators, where agents interact freely rather than execute fixed, pre-specified workflows. In such settings, effective coordination cannot be fully designed in advance and must instead emerge through interaction. However, most prior work enforces coordination through fixed roles, workflows, or aggregation rules, leaving open the question of how well self-organizing teams perform when coordination is unconstrained. Drawing on organizational psychology, we study whether self-organizing LLM teams achieve strong synergy, where team performance matches or exceeds the best individual member. Across human-inspired and frontier ML benchmarks, we find that -- unlike human teams -- LLM teams consistently fail to match their expert agent's performance, even when explicitly told who the expert is, incurring performance losses of up to 37.6%. Decomposing this failure, we show that expert leveraging, rather than identification, is the primary bottleneck. Conversational analysis reveals a tendency toward integrative compromise -- averaging expert and non-expert views rather than appropriately weighting expertise -- which increases with team size and correlates negatively with performance. Interestingly, this consensus-seeking behavior improves robustness to adversarial agents, suggesting a trade-off between alignment and effective expertise utilization. Our findings reveal a significant gap in the ability of self-organizing multi-agent teams to harness the collective expertise of their members.
研究动机与目标
- Investigate whether self-organizing heterogeneous LLM teams can achieve strong synergy, matching or exceeding their best member.
- Examine whether failures arise from expert identification or from leveraging expertise during interaction.
- Identify structural and interactional factors that correlate with absence of strong synergy in self-organizing AI teams.
提出的方法
- Replicate classic human teamwork tasks (NASA Moon Survival, Lost at Sea, Student Body President) with AI agents under controllable expert distribution.
- Evaluate frontier ML benchmarks (MMLU Pro, GPQA Diamond, HLE, MATH-500, SimpleQA) with naturally distributed expert knowledge.
- Compare conditions where expertise is not disclosed, disclosed, or represented by the best individual to decompose performance gaps.
- Measure performance with L1 error for ranking tasks and relative synergy gaps across configurations.
- Conduct ablations to separate expert identification from expert leveraging.
- Perform conversational analysis to link dynamics (epistemic deference vs. integrative compromise) to performance outcomes.
实验结果
研究问题
- RQ1Can heterogeneous LLM teams self-organize to achieve strong synergy and match or exceed their strongest member?
- RQ2Is the shortfall due to failing to identify who is the expert or due to failing to leverage the expert once identified?
- RQ3What structural/interactional factors (team size, negotiation style) correlate with missing strong synergy?
主要发现
| Table 1: Relative Synergy Gaps across Human Psychology Tasks | Table 2: Performance on ML benchmarks | |||
|---|---|---|---|---|
| NASA Moon Survival | 78.7% ± 11.6% | 81.8% ± 12.9% | 113.4% ± 19.0% | 110.1% ± 19.0% |
| Lost at Sea | 55.6% ± 8.4% | 58.6% ± 11.5% | 50.1% ± 8.3% | 42.1% ± 6.9% |
| Student Body President | 98.7% ± 19.3% | 73.5% ± 17.6% | 66.0% ± 16.6% | 17.3% ± 17.7% |
| SimpleQA | 50.0% | 54.0% | 61.5% | 18.7% |
| GPQA Diamond | 74.0% | 82.0% | 88.5% | 16.4% |
| HLE Text-Only | 29.0% | 35.0% | 46.5% | 37.6% |
| MATH-500 | 67.0% | 73.0% | 79.0% | 15.2% |
| MMLU Pro | 85.0% | 89.0% | 92.5% | 8.1% |
- LLM teams consistently fail to match their best member, with relative synergy gaps ranging from 8.1% to 37.6% across tasks.
- The primary bottleneck is expert leveraging rather than identification; revealing the expert yields only modest gains.
- Teams show integrative compromise, averaging expert and non-expert views, which correlates negatively with performance and worsens with larger team sizes.
- Consensus-seeking behavior provides robustness to adversarial agents, indicating a trade-off between expertise leveraging and manipulation resistance.
- Expertise dilution increases with team size, reducing performance relative to the expert across tasks (significant correlation, p<0.05).
- In psychology tasks, even with optimized prompts to defer to the expert, teams underperform the expert by substantial margins (e.g., Lost at Sea concentrated: ~55.6% relative synergy gap under Expert Not Mentioned).
- ML benchmarks show relative synergy gaps from 8.1% (MMLU Pro) to 37.6% (HLE Text-Only) under various conditions, even when the best per-problem expert is known.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。