QUICK REVIEW

[論文レビュー] Multi-Agent Teams Hold Experts Back

Aneesh Pappu, Batu El|arXiv (Cornell University)|Feb 1, 2026

Mobile Crowdsensing and Crowdsourcing被引用数 2

ひとこと要約

要約: 論文は自己組織化されたマルチエージェントLLMチームが専門家メンバーを活用できず、心理学タスクとMLベンチマークにおいて最高の個人を8.1%–37.6%上回らないことを示しており、専門知識の活用方法（または活用しない方法）による影響が原因である。

ABSTRACT

Multi-agent LLM systems are increasingly deployed as autonomous collaborators, where agents interact freely rather than execute fixed, pre-specified workflows. In such settings, effective coordination cannot be fully designed in advance and must instead emerge through interaction. However, most prior work enforces coordination through fixed roles, workflows, or aggregation rules, leaving open the question of how well self-organizing teams perform when coordination is unconstrained. Drawing on organizational psychology, we study whether self-organizing LLM teams achieve strong synergy, where team performance matches or exceeds the best individual member. Across human-inspired and frontier ML benchmarks, we find that -- unlike human teams -- LLM teams consistently fail to match their expert agent's performance, even when explicitly told who the expert is, incurring performance losses of up to 37.6%. Decomposing this failure, we show that expert leveraging, rather than identification, is the primary bottleneck. Conversational analysis reveals a tendency toward integrative compromise -- averaging expert and non-expert views rather than appropriately weighting expertise -- which increases with team size and correlates negatively with performance. Interestingly, this consensus-seeking behavior improves robustness to adversarial agents, suggesting a trade-off between alignment and effective expertise utilization. Our findings reveal a significant gap in the ability of self-organizing multi-agent teams to harness the collective expertise of their members.

研究の動機と目的

自己組織化されたヒポレジオニックLLMチームが強いシナジーを達成し、最良のメンバーに匹敵またはそれを上回ることができるかを調査する。
専門家の特定の失敗と、対話中の専門知識の活用不足のいずれが原因かを検討する。
自己組織化AIチームにおける強いシナジーの欠如と相関する構造的・相互作用的要因を特定する。

提案手法

AIエージェントを用いて古典的な人間のチームワーク課題（NASA Moon Survival、Lost at Sea、Student Body President）を実験可能な専門家分布下で再現する。
自然に分布した専門知識を持つ frontier MLベンチマーク（MMLU Pro、GPQA Diamond、HLE、MATH-500、SimpleQA）を評価する。
専門家が開示されていない場合、開示されている場合、または最高個人によって表現されている場合の条件を比較して性能ギャップを分解する。
ランキングタスクでのL1誤差と構成間の相対シナジーギャップで性能を測定する。
専門家の特定と専門家の活用を分離するアブレーションを実施する。
対話分析を行い、認識論的服従と統合的妥協のダイナミクスを性能結果に結び付ける。

実験結果

リサーチクエスチョン

RQ1異種LLMチームは強いシナジーを自律的に生み出し、最も強いメンバーに匹敵または上回ることができるか？
RQ2専門家を特定できないことが不足の原因か、特定後の専門家活用の失敗が原因か？
RQ3チームサイズ、交渉スタイルなど、強いシナジー欠如と相関する構造/相互作用要因は何か？

主な発見

Table 1: Relative Synergy Gaps across Human Psychology Tasks	Table 2: Performance on ML benchmarks
NASA Moon Survival	78.7% ± 11.6%	81.8% ± 12.9%	113.4% ± 19.0%	110.1% ± 19.0%
Lost at Sea	55.6% ± 8.4%	58.6% ± 11.5%	50.1% ± 8.3%	42.1% ± 6.9%
Student Body President	98.7% ± 19.3%	73.5% ± 17.6%	66.0% ± 16.6%	17.3% ± 17.7%
SimpleQA	50.0%	54.0%	61.5%	18.7%
GPQA Diamond	74.0%	82.0%	88.5%	16.4%
HLE Text-Only	29.0%	35.0%	46.5%	37.6%
MATH-500	67.0%	73.0%	79.0%	15.2%
MMLU Pro	85.0%	89.0%	92.5%	8.1%

LLMチームは最良メンバーに一貫して追いつけず、相対的シナジーギャップはタスク間で8.1%〜37.6%の範囲。
主なボトルネックは専門家の活用であり、特定自体は大きな利得を生まない。
チームは統合的妥協を示し、専門家と非専門家の見解を平均する傾向があり、これが性能と負の相関を持ち、チームサイズが大きくなると悪化する。
専門家活用と対立耐性のトレードオフを示す、対立的エージェントに対する頑健性を提供するコンセンサス追求行動。
専門知識の希釈はチームサイズとともに増加し、専門家に対する相対的性能が低下する（有意な相関、p<0.05）。
心理学タスクでは、専門家へ委任する最適化プロンプトを用いても、専門家が情報を取らずに扱われる場合であっても、チームは専門家を大きく下回る（例：Lost at Sea集中: 約55.6%の相対シナジーギャップ、Expert Not Mentioned時）。
MLベンチマークでは、最良の問題別専門家が分かっている場合でも、8.1%（MMLU Pro）から37.6%（HLE Text-Only）までの相対シナジーギャップが生じる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。