QUICK REVIEW

[论文解读] Multi-Step Reasoning with Large Language Models, a Survey

Aske Plaat, Annie Wong|arXiv (Cornell University)|Jul 16, 2024

Natural Language Processing Techniques被引用 12

一句话总结

这项综述回顾了大型语言模型中的基于提示的多步推理，提出一个三阶段分类法（生成、评估、控制）并总结基准测试和未来研究方向。

ABSTRACT

Large language models (LLMs) with billions of parameters exhibit in-context learning abilities, enabling few-shot learning on tasks that the model was not specifically trained for. Traditional models achieve breakthrough performance on language tasks, but do not perform well on basic reasoning benchmarks. However, a new in-context learning approach, Chain-of-thought, has demonstrated strong multi-step reasoning abilities on these benchmarks. The research on LLM reasoning abilities started with the question whether LLMs can solve grade school math word problems, and has expanded to other tasks in the past few years. This article reviews the field of multi-step reasoning with LLMs. We propose a taxonomy that identifies different ways to generate, evaluate, and control multi-step reasoning. We provide an in-depth coverage of core approaches and open problems, and we propose a research agenda for the near future. We find that multi-step reasoning approaches have progressed beyond math word problems, and can now successfully solve challenges in logic, combinatorial games, and robotics, sometimes by first generating code that is then executed by external tools. Many studies in multi-step methods use reinforcement learning for finetuning, external optimization loops, in-context reinforcement learning, and self-reflection.

研究动机与目标

评估基于提示的方法如何在大型语言模型（LLMs）中实现多步推理。
为提示中的推理步骤的生成、评估和控制提供分类法。
总结基准进展并明确未解问题与研究议程。

提出的方法

定义一个三阶段推理管线：生成步骤、评估步骤、控制推理过程。
将用于步骤生成的方法分为手写、外部知识和模型生成的提示。
综述评估策略，包括自我评估、基于工具的验证和外部批评者。
将控制策略映射为从贪心到集成和基于搜索的方法（如广度优先搜索/深度优先搜索、强化学习）。
回顾超越数学文字题的领域应用（编码、自主代理）并讨论对齐/锚定。

Figure 1: Taxonomy of LLM-Reasoning Approaches: Prompt Generation, Evaluation, and Control

实验结果

研究问题

RQ1哪些基于提示的技术能够在跨领域的LLMs中实现有效的多步推理？
RQ2如何组织推理步骤的生成、评估与控制以提升性能和鲁棒性？
RQ3哪些基准（如 GSM8K 及相关数据集）揭示当前推理方法的优势与局限？

主要发现

连锁思考提示在数学文字题上相较直接答案带来显著性能提升（如 GSM8K）。
零-shot 提示如“让我们一步一步地想”提升算术、符号和逻辑任务的推理能力。
基准测试显示难度差异显著，当前方法在不同数据集上表现不同（GSM8K、ASDiv、MAWPS、SVAMP、AQuA）。
自动生成的提示在若干基准上可与手写提示相匹配甚至超越。
多样的推理控制策略（自我验证、多数投票、基于工具的评估、BFS/DFS、强化学习）有助于减轻错误累积。
推理研究与自我反思、元认知以及通向通用人工智能的路径相关。

Figure 2: Example of input and target for supervised learning on a long addition problem of adding two numbers. The carry is recorded in the C: digit. Comments (after #) are not part of the learning target (Nye et al., 2021 )

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。