QUICK REVIEW

[論文レビュー] Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Gu|arXiv (Cornell University)|May 24, 2022

Topic Modeling被引用数 1,101

ひとこと要約

固定プロンプト「Let’s think step by step」を使うと、さまざまなタスクにおいてゼロショットの連鎖思考推論を可能にし、標準的なゼロショットプロンプトに比べて大幅な性能向上をもたらし、いくつかのベンチマークでfew-shot CoTレベルに近づく。

ABSTRACT

Pretrained large language models (LLMs) are widely used in many sub-fields of natural language processing (NLP) and generally known as excellent few-shot learners with task-specific exemplars. Notably, chain of thought (CoT) prompting, a recent technique for eliciting complex multi-step reasoning through step-by-step answer examples, achieved the state-of-the-art performances in arithmetics and symbolic reasoning, difficult system-2 tasks that do not follow the standard scaling laws for LLMs. While these successes are often attributed to LLMs' ability for few-shot learning, we show that LLMs are decent zero-shot reasoners by simply adding "Let's think step by step" before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics (MultiArith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date Understanding, Tracking Shuffled Objects), without any hand-crafted few-shot examples, e.g. increasing the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with large InstructGPT model (text-davinci-002), as well as similar magnitudes of improvements with another off-the-shelf large model, 540B parameter PaLM. The versatility of this single prompt across very diverse reasoning tasks hints at untapped and understudied fundamental zero-shot capabilities of LLMs, suggesting high-level, multi-task broad cognitive capabilities may be extracted by simple prompting. We hope our work not only serves as the minimal strongest zero-shot baseline for the challenging reasoning benchmarks, but also highlights the importance of carefully exploring and analyzing the enormous zero-shot knowledge hidden inside LLMs before crafting finetuning datasets or few-shot exemplars.

研究の動機と目的

大規模言語モデル（LLMs）のゼロショット推論能力の調査を促進する。
タスク依存の例示なしに、単一のタスク非依存的プロンプトが多段階の推論を引き出せるかを評価する。
算術・記号・コモンセンス・論理など、さまざまな推論ベンチマークでの性能向上を定量化する。
ゼロショット prompting アプローチのモデルサイズ効果と汎用性を評価する。

提案手法

Zero-shot-CoTを導入する：タスク内の例示なしで連鎖的思考推論を誘発する単一の固定プロンプト。
二段階 prompting プロセスを使用する：まず推論経路を引き出し、次に推論テキストから最終解を抽出する。
各回答の前にシンプルなトリガー文（例：「Let’s think step by step」）を適用して、段階的推論を促す。
モデル出力から最終解を解析するための回答抽出プロンプトを使う。
算術・コモンセンス・記号・その他の論理推論タスクにまたがる12データセットを対象に、決定論性のためにGreedyデコードを用いて評価する。

実験結果

リサーチクエスチョン

RQ1単一のゼロショットプロンプトが、非常に異なる推論タスク間で多段階推論を引き出せるか？
RQ2標準的なゼロショットプロンプトと比較して、ゼロショット-CoTプロンプトは算術・記号・コモンセンス・論理ベンチマークの性能へどう影響するか？
RQ3モデルサイズはゼロショット連鎖思考プロンプトの有効性に影響を与えるか？
RQ4ゼロショット-CoTはタスクごとのプロンプト設計なしで、多様なタスクに汎用的なベースラインとなり得るか？

主な発見

Zero-shot-CoTは、多くの算術・記号タスク（例：MultiArith、GSM8K）で標準的なゼロショットプロンプティングを大幅に上回る。
Zero-shot-CoTは、ゼロショットベースラインより大きな利得を達成する（例：MultiArith 17.7%から78.7%；GSM8K 10.4%から40.7%、InstructGPT-3使用時）。
PaLM（540B）による算術ベンチマークでも同様の改善量が見られる。
少数ショットCoTは、慎重に作成された例が利用できる場合には依然として優れているが、Zero-shot-CoTは強力で広く適用可能なゼロショットベースラインを提供する。
モデルサイズは連鎖思考 prompting の利点を増幅する；小さなモデルはCoTなしでは限られた gains、より大きなモデルはZero-shot-CoTとともに利益を拡大する。
Zero-shot-CoTは、最終解が誤っていても、タスク横断でもっともらしい推論経路を生み出すことがあり、より広範な認知能力が活性化されていることを示唆する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。