QUICK REVIEW

[论文解读] Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Gu|arXiv (Cornell University)|May 24, 2022

Topic Modeling被引用 1,101

一句话总结

一个固定提示“让我们一步步来思考”在多样化任务中实现零-shot 链式推理，显著提升相对于标准零-shot提示的性能，在若干基准测试上接近少量示例的链式推理水平。

ABSTRACT

Pretrained large language models (LLMs) are widely used in many sub-fields of natural language processing (NLP) and generally known as excellent few-shot learners with task-specific exemplars. Notably, chain of thought (CoT) prompting, a recent technique for eliciting complex multi-step reasoning through step-by-step answer examples, achieved the state-of-the-art performances in arithmetics and symbolic reasoning, difficult system-2 tasks that do not follow the standard scaling laws for LLMs. While these successes are often attributed to LLMs' ability for few-shot learning, we show that LLMs are decent zero-shot reasoners by simply adding "Let's think step by step" before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics (MultiArith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date Understanding, Tracking Shuffled Objects), without any hand-crafted few-shot examples, e.g. increasing the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with large InstructGPT model (text-davinci-002), as well as similar magnitudes of improvements with another off-the-shelf large model, 540B parameter PaLM. The versatility of this single prompt across very diverse reasoning tasks hints at untapped and understudied fundamental zero-shot capabilities of LLMs, suggesting high-level, multi-task broad cognitive capabilities may be extracted by simple prompting. We hope our work not only serves as the minimal strongest zero-shot baseline for the challenging reasoning benchmarks, but also highlights the importance of carefully exploring and analyzing the enormous zero-shot knowledge hidden inside LLMs before crafting finetuning datasets or few-shot exemplars.

研究动机与目标

激发对大型语言模型（LLMs）零-shot 推理能力的研究。
评估单一、任务无关的提示是否能够在无任务内示例的情况下诱发多步推理。
量化在多样化推理基准上的性能提升（算术、符号、常识、逻辑）。
评估模型规模对零-shot 提示方法的影响及其通用性。

提出的方法

引入 Zero-shot-CoT：一个单一、固定的提示，在没有任务内示例的情况下诱导链式思考推理。
使用两阶段提示过程：先引出推理路径，然后从推理文本中提取最终答案。
在每个答案之前应用一个简单的触发句（例如“让我们一步一步地思考”）以促使逐步推理。
采用答案提取提示从模型输出中解析最终答案。
在涵盖算术、常识、符号以及其他逻辑推理任务的12个数据集上进行评估，使用贪婪解码以实现确定性。

实验结果

研究问题

RQ1单一的零-shot 提示是否能够在极其不同的推理任务之间引发多步推理？
RQ2与标准零-shot 提示相比，zero-shot-CoT 提示在算术、符号、常识和逻辑基准上的性能如何？
RQ3模型规模是否会影响零-shot 链式推理提示的有效性？
RQ4zero-shot-CoT 是否是一个通用于多种任务且无需逐任务提示工程的多功能基线？

主要发现

Zero-shot-CoT 在许多算术和符号任务（如 MultiArith 和 GSM8K）上显著优于标准零-shot 提示。
Zero-shot-CoT 相较于零-shot 基线获得了显著提升（例如，在 InstructGPT-3 下，MultiArith 从 17.7% 提升到 78.7%；GSM8K 从 10.4% 提升到 40.7%）。
在算术基准上，PaLM（540B）也观察到类似数量级的改进。
当有精心设计的示例时，Few-shot-CoT 仍然更优，但 Zero-shot-CoT 提供了一个强大、广泛适用的零-shot 基线。
模型规模放大了链式思考提示的收益；较小的模型在没有 CoT 时收益有限，而较大的模型在 Zero-shot-CoT 下收益显著提升。
Zero-shot-CoT 也在各任务中产生了合理的推理路径，即使最终答案不正确，也暗示了更广泛的认知能力被触及。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。