QUICK REVIEW

[论文解读] Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

Shuo Liu, Tianle Chen|arXiv (Cornell University)|Jan 29, 2026

Reinforcement Learning in Robotics被引用 0

一句话总结

本文提出两种多智能体 Actor-Critic 方法 CoLLM-CC（集中化评估器）和 CoLLM-DC（分散化评估器），以优化去中心化的 LLM 协作，并在写作、编码和游戏任务上与蒙特卡洛方法进行比较。CoLLM-CC 在长时程或稀疏奖励任务上通常优于其他方法，而 CoLLM-DC 在密集奖励、短时程设置中具有竞争性结果。

ABSTRACT

Recent work has explored optimizing LLM collaboration through Multi-Agent Reinforcement Learning (MARL). However, most MARL fine-tuning approaches rely on predefined execution protocols, which often require centralized execution. Decentralized LLM collaboration is more appealing in practice, as agents can run inference in parallel with flexible deployments. Also, current approaches use Monte Carlo methods for fine-tuning, which suffer from high variance and thus require more samples to train effectively. Actor-critic methods are prevalent in MARL for dealing with these issues, so we developed Multi-Agent Actor-Critic (MAAC) methods to optimize decentralized LLM collaboration. In this paper, we analyze when and why these MAAC methods are beneficial. We propose 2 MAAC approaches, extbf{CoLLM-CC} with a extbf{C}entralized extbf{C}ritic and extbf{CoLLM-DC} with extbf{D}ecentralized extbf{C}ritics. Our experiments across writing, coding, and game-playing domains show that Monte Carlo methods and CoLLM-DC can achieve performance comparable to CoLLM-CC in short-horizon and dense-reward settings. However, they both underperform CoLLM-CC on long-horizon or sparse-reward tasks, where Monte Carlo methods require substantially more samples and CoLLM-DC struggles to converge. Our code is available at https://github.com/OpenMLRL/CoMLRL/releases/tag/v1.3.6.

研究动机与目标

推动并实现多 LL M 代理在没有集中执行约束的情况下的去中心化协作。
分析 MAAC 方法在微调方面何时以及为何优于基于蒙特卡洛的基线。
提出两种基于 MAAC 的框架：CoLLM-CC（集中化评估器）与 CoLLM-DC（分散化评估器）。
在写作、编码和游戏领域评估性能，以识别优势与局限性。

提出的方法

开发 MAAC 方法以优化去中心化 LLM 协作用于 RL 微调。
引入 CoLLM-CC，其集中化评估器估计联合历史值。
引入 CoLLM-DC，其分散化评估器估计个体历史值。
使用基于 Transformer 的历史表示通过 KV 缓存处理长对话历史。
应用教师强制（TF）前向传播来计算宏动作（整个回答）的序列层级概率。
提供 MAAC 方法的理论分析，包括偏差/方差考量和稳定性。

实验结果

研究问题

RQ1在何种条件下 MAAC 方法在去中心化的 LLM 协作中优于基于蒙特卡洛的微调？
RQ2集中化评估器与分散化评估器如何影响学习效率、收敛性和在短时程 vs 长时程任务中的性能？
RQ3CoLLM-CC 与 CoLLM-DC 在不同领域的样本效率、收敛性与可扩展性之间有哪些权衡？
RQ4在训练受益时，CoLLM-CC 与 CoLLM-DC 是否保持去中心化执行？
RQ5历史表示（KV 缓存）如何影响学习与性能？

主要发现

在密集奖励、短时程写作任务中，蒙特卡洛方法和 CoLLM-DC 的表现可与 CoLLM-CC 相近。
在稀疏奖励的编码任务和长时程的 Minecraft 任务中，CoLLM-CC 对 MAAC 基线表现优于蒙特卡洛，且蒙特卡洛需要明显更多样本，而 CoLLM-DC 无法收敛。
在所有任务中，CoLLM-CC 始终优于蒙特卡洛与 CoLLM-DC，特别是在长时程任务上。
CoLLM-DC 在短时程、密集设置中提供有竞争力的结果，但可能在稳定信号和收敛方面存在困难。
TA：集中化评估器（CoLLM-CC）在具有联合历史条件的挑战性任务中提供更稳定的值估计。
本研究提供可用于复现实验的 GitHub 发布代码。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。