QUICK REVIEW

[论文解读] Self-Reflection in LLM Agents: Effects on Problem-Solving Performance

Matthew Renze, Erhan Guven|arXiv (Cornell University)|May 5, 2024

Multi-Agent Systems and Negotiation被引用 11

一句话总结

该论文显示，九种流行的大语言模型在允许对错误进行自我反思时，显著提升了多项模型和领域中的MCQA问题解决性能，且反思类型越信息丰富，收益越大。

ABSTRACT

In this study, we investigated the effects of self-reflection in large language models (LLMs) on problem-solving performance. We instructed nine popular LLMs to answer a series of multiple-choice questions to provide a performance baseline. For each incorrectly answered question, we instructed eight types of self-reflecting LLM agents to reflect on their mistakes and provide themselves with guidance to improve problem-solving. Then, using this guidance, each self-reflecting agent attempted to re-answer the same questions. Our results indicate that LLM agents are able to significantly improve their problem-solving performance through self-reflection ($p < 0.001$). In addition, we compared the various types of self-reflection to determine their individual contribution to performance. All code and data are available on GitHub at https://github.com/matthewrenze/self-reflection

研究动机与目标

推动元认知自我反思在提升LLM问题解决能力中的应用。
系统地将自我反思分解为不同组成部分并评估它们的贡献。
比较多种LLM和问题领域，以识别在哪些方面反思带来最大的收益。
为利用自我反思的具备代理能力的LLM系统的工程实现提供实用指南。

提出的方法

从多个基准（ARC、AGIEval、Hellaswag、MedMCQA 等）组建一个包含1,000道题目的多领域MCQA考试。
通过基线（无自我反思）提示来评估九个LLM，以获得性能基线。
对于每个Baseline错误项，运行八种自我反思类型（Retry、Keywords、Advice、Explanation、Instructions、Solution、Composite、Unredacted），以使用正确答案作为反馈生成指导。
将自我反思注入重新回答的提示，并仅重新解决先前错误的问题。
在自我反思中对答案进行隐藏（除Unredacted代理外），以防止泄露。
将准确率计算为（Baseline正确 + 重新回答正确）/ Baseline总数，并使用McNemar检验评估显著性。

Figure 1: Diagram of the self-reflection experiment.

实验结果

研究问题

RQ1自我反思策略是否在多种LLM上提升了MCQA的性能？
RQ2哪些类型的自我反思对性能提升贡献最大？
RQ3反思收益如何在不同问题领域和模型之间变化？
RQ4与自我反思提示相关的局限性和泄露风险有哪些？

主要发现

代理	准确率	差异	检验统计量	p 值
Baseline	0.786	N/A	N/A	N/A
Retry	0.827	0.041	39.024	<0.001
Keywords	0.832	0.046	44.022	<0.001
Advice	0.840	0.054	52.019	<0.001
Instructions	0.849	0.063	61.016	<0.001
Explanation	0.876	0.090	88.011	<0.001
Solution	0.925	0.139	137.007	<0.001
Composite	0.932	0.146	144.007	<0.001
Unredacted	0.971	0.185	183.005	<0.001

所有自我反思类型在所有测试的LLM上相较于基线显著提高了准确率（p < 0.001）。
信息量更丰富的自我反思类型（例如 Instructions、Explanation、Solution、Composite）带来比比较轻类型（Retry、Keywords、Advice）更大的准确率提升。
Unredacted代理在GPT-4的所有代理中达到最高准确率（0.971），在没有泄露控制的情况下建立了上限。
在各模型中，LSAT-AR取得最大的提升，而某些领域如SAT-English的提升较小。
即使是简单的Retry反思也产生了显著增益，表明仅仅传达先前错误也能提升后续尝试。

Figure 2: All self-reflection types improved the accuracy of GPT-4 agents.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。