QUICK REVIEW

[论文解读] CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing

Zhibin Gou, Zhihong Shao|arXiv (Cornell University)|May 19, 2023

Topic Modeling被引用 57

一句话总结

CRITIC 使冻结的 LLM 能通过与外部工具交互来验证并逐步修正其输出，从而在不进行额外训练的情况下提高真实度、数学程序合成能力和减低毒性。

ABSTRACT

Recent developments in large language models (LLMs) have been impressive. However, these models sometimes show inconsistencies and problematic behavior, such as hallucinating facts, generating flawed code, or creating offensive and toxic content. Unlike these models, humans typically utilize external tools to cross-check and refine their initial content, like using a search engine for fact-checking, or a code interpreter for debugging. Inspired by this observation, we introduce a framework called CRITIC that allows LLMs, which are essentially "black boxes" to validate and progressively amend their own outputs in a manner similar to human interaction with tools. More specifically, starting with an initial output, CRITIC interacts with appropriate tools to evaluate certain aspects of the text, and then revises the output based on the feedback obtained during this validation process. Comprehensive evaluations involving free-form question answering, mathematical program synthesis, and toxicity reduction demonstrate that CRITIC consistently enhances the performance of LLMs. Meanwhile, our research highlights the crucial importance of external feedback in promoting the ongoing self-improvement of LLMs.

研究动机与目标

在不需要昂贵数据或微调的情况下，推动减少 LLM 的不一致性和不安全行为。
通过外部工具实现对 LLM 输出的自我验证和纠错，形成类人类的反馈循环。
展示工具反馈对在多任务中实现可靠自我提升至关重要。
展示 CRITIC 框架在多种 LLM 和任务中的通用性。

提出的方法

提出 CRITIC，一种即插即用的框架，其中 LLM 首先生成初始输出，然后通过与外部工具交互（例如搜索引擎、代码解释器）来获得批评并进行验证。
使用带有少量示例提示的上下文学习，使基于工具的验证和迭代纠错在不进行特定任务训练的情况下成为可能。
应用 verify–correct–verify 循环（Algorithm 1）来迭代地优化输出，直到满足停止条件。
将批评表示为来自工具增强验证的自然语言反馈，用以引导后续生成。
证明纠错依赖于初始输出、批评以及工具结果。
在多种 LLM 上评估 CRITIC 对自由问答、数学程序合成以及毒性降低的效果。

实验结果

研究问题

RQ1外部工具交互是否能够在不进行额外训练的情况下提高黑箱 LLM 输出的真实性和质量？
RQ2verify–correct–verify 循环在问答、数学编程和毒性降低任务中的性能有何影响？
RQ3外部反馈与自我纠错在实现可靠改进中的作用是什么？
RQ4CRITIC 的改进是否能在不同基础 LLM 和工具配置中泛化？

主要发现

CRITIC 在三个问答任务上为 ChatGPT 带来 7.7 的 F1 提升。
CRITIC 在三项数学推理任务上实现了 7.0 个百分点的绝对提升。
在毒性降低实验中，CRITIC 将毒性概率降低了 79.2%。
CRITIC 在无需任务特定训练或额外数据的情况下，持续超越现有方法。
来自工具交互的外部反馈对可靠的自我提升至关重要，而单独的自我纠错可能不可靠。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。