Skip to main content
QUICK REVIEW

[论文解读] Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Gu|arXiv (Cornell University)|May 24, 2022
Topic Modeling被引用 1,101
一句话总结

一个固定提示“让我们一步步来思考”在多样化任务中实现零-shot 链式推理,显著提升相对于标准零-shot提示的性能,在若干基准测试上接近少量示例的链式推理水平。

ABSTRACT

Pretrained large language models (LLMs) are widely used in many sub-fields of natural language processing (NLP) and generally known as excellent few-shot learners with task-specific exemplars. Notably, chain of thought (CoT) prompting, a recent technique for eliciting complex multi-step reasoning through step-by-step answer examples, achieved the state-of-the-art performances in arithmetics and symbolic reasoning, difficult system-2 tasks that do not follow the standard scaling laws for LLMs. While these successes are often attributed to LLMs' ability for few-shot learning, we show that LLMs are decent zero-shot reasoners by simply adding "Let's think step by step" before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics (MultiArith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date Understanding, Tracking Shuffled Objects), without any hand-crafted few-shot examples, e.g. increasing the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with large InstructGPT model (text-davinci-002), as well as similar magnitudes of improvements with another off-the-shelf large model, 540B parameter PaLM. The versatility of this single prompt across very diverse reasoning tasks hints at untapped and understudied fundamental zero-shot capabilities of LLMs, suggesting high-level, multi-task broad cognitive capabilities may be extracted by simple prompting. We hope our work not only serves as the minimal strongest zero-shot baseline for the challenging reasoning benchmarks, but also highlights the importance of carefully exploring and analyzing the enormous zero-shot knowledge hidden inside LLMs before crafting finetuning datasets or few-shot exemplars.

研究动机与目标

  • 激发对大型语言模型(LLMs)零-shot 推理能力的研究。
  • 评估单一、任务无关的提示是否能够在无任务内示例的情况下诱发多步推理。
  • 量化在多样化推理基准上的性能提升(算术、符号、常识、逻辑)。
  • 评估模型规模对零-shot 提示方法的影响及其通用性。

提出的方法

  • 引入 Zero-shot-CoT:一个单一、固定的提示,在没有任务内示例的情况下诱导链式思考推理。
  • 使用两阶段提示过程:先引出推理路径,然后从推理文本中提取最终答案。
  • 在每个答案之前应用一个简单的触发句(例如“让我们一步一步地思考”)以促使逐步推理。
  • 采用答案提取提示从模型输出中解析最终答案。
  • 在涵盖算术、常识、符号以及其他逻辑推理任务的12个数据集上进行评估,使用贪婪解码以实现确定性。

实验结果

研究问题

  • RQ1单一的零-shot 提示是否能够在极其不同的推理任务之间引发多步推理?
  • RQ2与标准零-shot 提示相比,zero-shot-CoT 提示在算术、符号、常识和逻辑基准上的性能如何?
  • RQ3模型规模是否会影响零-shot 链式推理提示的有效性?
  • RQ4zero-shot-CoT 是否是一个通用于多种任务且无需逐任务提示工程的多功能基线?

主要发现

  • Zero-shot-CoT 在许多算术和符号任务(如 MultiArith 和 GSM8K)上显著优于标准零-shot 提示。
  • Zero-shot-CoT 相较于零-shot 基线获得了显著提升(例如,在 InstructGPT-3 下,MultiArith 从 17.7% 提升到 78.7%;GSM8K 从 10.4% 提升到 40.7%)。
  • 在算术基准上,PaLM(540B)也观察到类似数量级的改进。
  • 当有精心设计的示例时,Few-shot-CoT 仍然更优,但 Zero-shot-CoT 提供了一个强大、广泛适用的零-shot 基线。
  • 模型规模放大了链式思考提示的收益;较小的模型在没有 CoT 时收益有限,而较大的模型在 Zero-shot-CoT 下收益显著提升。
  • Zero-shot-CoT 也在各任务中产生了合理的推理路径,即使最终答案不正确,也暗示了更广泛的认知能力被触及。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。