QUICK REVIEW

[논문 리뷰] Large Language Models are Zero-Shot Reasoners

Takeshi Kojima, Shixiang Gu|arXiv (Cornell University)|2022. 05. 24.

Topic Modeling인용 수 1,101

한 줄 요약

고정된 프롬프트, 'Let's think step by step,' 은 다양한 작업에서 제로샷 체인-오브-사고 추론을 가능하게 하여 표준 제로샷 프롬프트 대비 큰 성능 향상을 보이고, 여러 벤치마크에서 몇몇 경우에는 few-shot CoT 수준에 근접하게 만든다.

ABSTRACT

Pretrained large language models (LLMs) are widely used in many sub-fields of natural language processing (NLP) and generally known as excellent few-shot learners with task-specific exemplars. Notably, chain of thought (CoT) prompting, a recent technique for eliciting complex multi-step reasoning through step-by-step answer examples, achieved the state-of-the-art performances in arithmetics and symbolic reasoning, difficult system-2 tasks that do not follow the standard scaling laws for LLMs. While these successes are often attributed to LLMs' ability for few-shot learning, we show that LLMs are decent zero-shot reasoners by simply adding "Let's think step by step" before each answer. Experimental results demonstrate that our Zero-shot-CoT, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics (MultiArith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date Understanding, Tracking Shuffled Objects), without any hand-crafted few-shot examples, e.g. increasing the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with large InstructGPT model (text-davinci-002), as well as similar magnitudes of improvements with another off-the-shelf large model, 540B parameter PaLM. The versatility of this single prompt across very diverse reasoning tasks hints at untapped and understudied fundamental zero-shot capabilities of LLMs, suggesting high-level, multi-task broad cognitive capabilities may be extracted by simple prompting. We hope our work not only serves as the minimal strongest zero-shot baseline for the challenging reasoning benchmarks, but also highlights the importance of carefully exploring and analyzing the enormous zero-shot knowledge hidden inside LLMs before crafting finetuning datasets or few-shot exemplars.

연구 동기 및 목표

대형 언어 모델(LLMs)의 제로샷 추론 능력에 대한 조사를 촉진한다.
단일의 작업 비의존 프롬프트가 작업 내 예시 없이 다단계 추론을 이끌어낼 수 있는지 평가한다.
산술, 기호적, 일반상식, 논리 등 다양한 추론 벤치마크에서의 성능 향상을 정량화한다.
제로샷 프롬프트 접근법의 모델 크기 효과와 일반성을 평가한다.

제안 방법

Zero-shot-CoT를 소개한다: 작업 내 예시 없이 체인-오브-생각 추론을 유도하는 단일의 고정 프롬프트.
두 단계 프롬프팅 프로세스를 사용한다: 먼저 추론 경로를 이끌어 내고, 그 후 추론 텍스트에서 최종 답을 추출한다.
각 답변 앞에 간단한 트리거 문장(예: “Let's think step by step”)을 적용하여 단계별 추론을 유도한다.
모델 출력에서 최종 답을 구문 분석하기 위한 답 추출 프롬프트를 사용한다.
산술, 일반상식, 기호적, 및 기타 논리적 추론 작업에 걸친 12개의 데이터 세트를 평가하고, 결정론성을 위해 그리디 디코딩을 사용한다.

실험 결과

연구 질문

RQ1단일 제로샷 프롬프트가 아주 다른 추론 작업들 간에 다단계 추론을 이끌어낼 수 있는가?
RQ2표준 제로샷 프롬프팅과 비교할 때 제로샷-CoT 프롬프트가 산술, 기호적, 일반상식, 논리 벤치마크의 성능에 어떤 영향을 미치는가?
RQ3모델 크기가 제로샷 체인-오브-생각 프롬프트의 효과에 영향을 미치는가?
RQ4제로샷-CoT가 작업별 프롬프트 엔지니어링 없이도 다양한 작업에 일반화되는 다목적 기준선인가?

주요 결과

Zero-shot-CoT는 다수의 산술 및 기호적 작업에서 표준 제로샷 프롬프트를 크게 능가한다(예: MultiArith 및 GSM8K).
Zero-shot-CoT는 제로샷 기준선 대비 큰 이득을 달성한다(예: InstructGPT-3과 함께 MultiArith 17.7%에서 78.7%로; GSM8K 10.4%에서 40.7%로).
PaLM(540B)에서도 산술 벤치마크에서 유사한 규모의 개선이 관찰된다.
정교하게 구성된 예제가 이용 가능할 때는 Few-shot-CoT가 여전히 우수하지만, Zero-shot-CoT는 강력하고 광범위하게 적용 가능한 제로샷 기준선을 제공한다.
모델 크기는 사고 흐름 프롬프트의 이점을 확대한다; 작은 모델은 CoT 없이 한계에 가까운 이득을 보이고, 큰 모델은 Zero-shot-CoT로 이점을 확대한다.
Zero-shot-CoT는 작업 간에 그럴듯한 추론 경로를 제공하기도 하며, 최종 답이 틀린 경우에도 광범위한 인지 능력이 활용되고 있음을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.