QUICK REVIEW

[论文解读] ART: Automatic multi-step reasoning and tool-use for large language models

Bhargavi Paranjape, Scott Lundberg|arXiv (Cornell University)|Mar 16, 2023

Topic Modeling被引用 48

一句话总结

ART 自动生成带有集成工具使用的多步推理程序，用于前所未见的任务，提升在少样本提示和 Auto-CoT 上的表现，同时保持可扩展性和易于人工编辑。

ABSTRACT

Large language models (LLMs) can perform complex reasoning in few- and zero-shot settings by generating intermediate chain of thought (CoT) reasoning steps. Further, each reasoning step can rely on external tools to support computation beyond the core LLM capabilities (e.g. search/running code). Prior work on CoT prompting and tool use typically requires hand-crafting task-specific demonstrations and carefully scripted interleaving of model generations with tool use. We introduce Automatic Reasoning and Tool-use (ART), a framework that uses frozen LLMs to automatically generate intermediate reasoning steps as a program. Given a new task to solve, ART selects demonstrations of multi-step reasoning and tool use from a task library. At test time, ART seamlessly pauses generation whenever external tools are called, and integrates their output before resuming generation. ART achieves a substantial improvement over few-shot prompting and automatic CoT on unseen tasks in the BigBench and MMLU benchmarks, and matches performance of hand-crafted CoT prompts on a majority of these tasks. ART is also extensible, and makes it easy for humans to improve performance by correcting errors in task-specific programs or incorporating new tools, which we demonstrate by drastically improving performance on select tasks with minimal human intervention.

研究动机与目标

实现对新任务的零-shot或少-shot分解为多步推理并使用工具。
利用任务库检索演示并引导LLM构建推理程序。
将外部工具（搜索、代码执行）集成到推理过程中，并在工具输出后继续生成。
在BigBench、MMLU及相关工具使用基准上展示跨任务泛化，重点放在算术和算法任务。
展示人类反馈和工具/库更新如何在不重新训练LLM的情况下进一步提升性能。

提出的方法

从结构化的任务库中检索相关任务演示，形成少-shot 提示。
使用受Beurer-Kellner 启发的语法（PeG）将分解表示为包含子步骤和工具调用的程序。
在工具调用处暂停生成，执行工具，然后在整合工具输出后继续生成。
在推理流程中使用工具库（搜索、代码执行）提供外部计算。
允许对任务或工具库进行可选的人类编辑，以注入更正或添加新工具，而不对模型进行微调。
在BigBench、MMLU和QA任务中使用冻结的LLMs（InstructGPT）和代码工具（Codex）进行评估。

实验结果

研究问题

RQ1在不对任务进行特定监督的情况下，利用任务库中的演示，冻结的LLM 是否能够将未见任务分解为带自动工具使用的多步推理？
RQ2在推理链中整合的工具调用是否相对于基线提示和自动生成的 CoT 在复杂任务上产生可衡量的提升？
RQ3在跨任务迁移基准（BigBench、MMLU）上，若没有对分解或工具使用的任务特定监督，ART 的表现如何？
RQ4在不进行模型微调的情况下，人工在任务/工具库中的编辑能在多大程度上进一步提升性能？

主要发现

ART 在 32/34 个 BigBench 和所有 MMLU 任务上始终与自动生成的 CoT 相当或超越，平均领先超过 22 个百分点。
与无工具基线相比，工具使用将测试时性能平均提升超过 12.3 个百分点。
在未见的 BigBench 和 MMLU 任务上，ART 相较直接少样本提示平均提升 10.8 个百分点。
在 12 个带有人类反馈的任务上，ART 平均超越已知的 GPT-3 最佳结果超过 20 个百分点。
ART 通过更新任务和工具库实现易于的人类干预，从而以最小的人力投入实现有针对性的改进。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。