QUICK REVIEW

[论文解读] Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li|arXiv (Cornell University)|May 23, 2023

Topic Modeling被引用 85

一句话总结

论文提出一个多代理辩论框架，在几个回合中让多个 LLM 实例生成、批评和辩论回答，以改进推理和事实性，在推理和事实性任务上相对于单模型基线取得显著提升。

ABSTRACT

Large language models (LLMs) have demonstrated remarkable capabilities in language generation, understanding, and few-shot learning in recent years. An extensive body of work has explored how their performance may be further improved through the tools of prompting, ranging from verification, self-consistency, or intermediate scratchpads. In this paper, we present a complementary approach to improve language responses where multiple language model instances propose and debate their individual responses and reasoning processes over multiple rounds to arrive at a common final answer. Our findings indicate that this approach significantly enhances mathematical and strategic reasoning across a number of tasks. We also demonstrate that our approach improves the factual validity of generated content, reducing fallacious answers and hallucinations that contemporary models are prone to. Our approach may be directly applied to existing black-box models and uses identical procedure and prompts for all tasks we investigate. Overall, our findings suggest that such "society of minds" approach has the potential to significantly advance the capabilities of LLMs and pave the way for further breakthroughs in language generation and understanding.

研究动机与目标

Motivate: 解决大语言模型（LLMs）中的幻觉和推理错误。
Propose: 一个多代理辩论框架，多个 LLM 实例生成并互相批评彼此的解答。
Demonstrate: 在仅使用黑盒模型访问的情况下，在多样化任务中实现改进的推理和事实准确性。）

提出的方法

为给定任务实例化多个相同或混合的 LLM 代理。
每个代理独立生成一个候选答案。
代理阅读并批评他人的回答，并在多轮迭代中汇聚共识。
使用提示来控制辩论长度和代理的固执程度以影响收敛。
证明与其他提示方法的正交性，并与逐步推理提示相结合。

Figure 1 : Multiagent Debate Improves Reasoning and Factual Accuracy. Accuracy of traditional inference and our multi-agent debate over six benchmarks (chess move optimality reported as a normalized score)

实验结果

研究问题

RQ1与单代理基线相比，多代理辩论是否提升推理能力？
RQ2此辩论框架是否在多样化任务中提升事实准确性并减少幻觉？
RQ3哪些设计选择（代理数量、轮数、提示）可优化性能？
RQ4该方法是否与其他提示策略和模型类型兼容？
RQ5即使单个代理不确定或错误，该方法能否生成稳健的共识？

主要发现

模型	算术（%）↑	小学数学（%）↑	棋类（Δ PS）↑
Single Agent	67.0 ± 4.7	77.0 ± 4.2	91.4 ± 10.6
Single Agent (Reflection)	72.1 ± 4.5	75.0 ± 4.3	102.1 ± 11.9
Multi-Agent (Majority)	69.0 ± 4.6	81.0 ± 3.9	102.2 ± 6.2
Multi-Agent (Debate)	81.8 ± 2.3	85.0 ± 3.5	122.9 ± 7.6

与单代理基线和反思相比，多代理辩论在算术、GSM8K 和棋步预测上显著提升推理能力。 (算术：81.8±2.3；GSM8K：85.0±3.5；棋类：122.9±7.6，ΔPS 指标，辩论 vs 基线)
辩论还提高了在生物传记、MMLU 和棋类有效性任务上的事实准确性，优于反思和单代理方法。 (生物传记：73.8±2.3；MMLU：71.1±4.6；棋类有效性：45.2±2.9)
增加代理数量和辩论轮次通常可以提升性能，但在达到某一程度后回报递减。
较长的辩论提示会减慢收敛，但可能产生更高质量的共识。
不同的初始化提示（代理人设）在某些任务上可带来进一步收益。
辩论使得即便初始答案错误也能收敛到共识，并且可减少不确定事实的包含。

Figure 2 : Illustration of Debate. Illustration of the debate procedure.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。