QUICK REVIEW

[논문 리뷰] Complexity-Based Prompting for Multi-Step Reasoning

Yao Fu, Hao Peng|arXiv (Cornell University)|2022. 10. 03.

Topic Modeling인용 수 73

한 줄 요약

이 논문은 프롬프트에서 더 복잡한 추론 체인 선택을 위한 복잡도 기반 프롬팅과 복잡도 기반 일관성을 소개하여 프롬프트와 디코딩에서 더 복잡한 추론 체인을 선택하고, GPT-3와 Codex로 다수의 다단계 추론 벤치마크에서 새로운 최첨단 결과를 달성한다.

ABSTRACT

We study the task of prompting large-scale language models to perform multi-step reasoning. Existing work shows that when prompted with a chain of thoughts (CoT), sequences of short sentences describing intermediate reasoning steps towards a final answer, large language models can generate new reasoning chains and predict answers for new inputs. A central question is which reasoning examples make the most effective prompts. In this work, we propose complexity-based prompting, a simple and effective example selection scheme for multi-step reasoning. We show that prompts with higher reasoning complexity, i.e., chains with more reasoning steps, achieve substantially better performance on multi-step reasoning tasks over strong baselines. We further extend our complexity-based criteria from prompting (selecting inputs) to decoding (selecting outputs), where we sample multiple reasoning chains from the model, then choose the majority of generated answers from complex reasoning chains (over simple chains). When used to prompt GPT-3 and Codex, our approach substantially improves multi-step reasoning accuracy and achieves new state-of-the-art (SOTA) performance on three math benchmarks (GSM8K, MultiArith, and MathQA) and two BigBenchHard tasks (Date Understanding and Penguins), with an average +5.3 and up to +18 accuracy improvements. Compared with existing example selection schemes like manual tuning or retrieval-based selection, selection based on reasoning complexity is intuitive, easy to implement, and annotation-efficient. Further results demonstrate the robustness of performance gains from complex prompts under format perturbation and distribution shift.

연구 동기 및 목표

더 복잡한 추론 체인을 사용하는 프롬프트로 다단계 추론의 향상을 촉진한다.
복잡한 프롬프트 예시를 선택하기 위한 간단하고 주석 효율적인 방법을 제안한다.
복잡도 기반 일관성(Complexity-based Consistency)으로 복잡한 추론 체인 간의 투표를 통해 디코딩으로 확장한다.
다양한 데이터셋과 모델 유형에 걸친 견고한 성능 향상을 시연한다.

제안 방법

Chain-of-thought(CoT)에서 더 많은 추론 단계를 가진 샘플을 복잡한 샘플로 정의한다.
GPT-3와 Codex를 사용하여 handcrafted 및 random CoT 프롬프트와의 비교를 수행한다.
모든 체인 대신 상위-K개의 복잡한 체인들 간의 투표를 통한 디코딩으로 확장한다(Complexity-based Consistency).
GSM8K, MultiArith, MathQA, Date Understanding, Penguins, 및 StrategyQA에서 평가한다.
프롬프트 분포와 섭동에 걸친 견고성을 보여주고 교란 변수(confounders)를 분석한다.
다른 예시 선택 스킴(random, centroid, retrieval)과 비교한다.

실험 결과

연구 질문

RQ1더 복잡한 추론 체인을 사용하는 프롬프트가 더 간단한 프롬프트에 비해 다단계 추론 정확도를 향상시키는가?
RQ2디코딩 시 가장 복잡한 체인의 출력을 선택하는 것이 모든 체인에 대한 투표보다 더 나은가?
RQ3복잡도 기반 프롘 prompting의 이익이 분포 변화, 프롬프트 섭동, 및 다양한 복잡도 프록시에 견고한가?
RQ4이 데이터세트들에서 복잡도 기반 프롬 prompting이 기존의 예시 선택 방법(random, centroid, retrieval)과 어떻게 비교되는가?
RQ5이 개선은 매우 큰 모델에서만 등장하는 능력인가?

주요 결과

복잡한 프롬프트는 GPT-3 및 Codex에서 handcrafted 또는 random CoT 프롬프트보다 상당히 높은 정확도를 보인다.
상위-K개의 복잡한 체인 간의 투표(Complexity-based Consistency)가 모든 체인 및 간단한 체인에 대한 투표보다 우수하다.
GSM8K, MultiArith, MathQA에서 새로운 최첨단 성능을 달성하고 Date Understanding 및 Penguins에서도 강한 결과를 얻으며, 평균 증가값은 GPT-3에서 +5.3, Codex에서 +6.2이다.
프롬프트 분포(in-distribution, noisy, 및 distribution shift)에서 견고성을 보이고 프롬프트 형식의 섭동 하에서도 유지된다.
복잡한 프롬 prompting은 Retrieval 기반 또는 전체 학습 세트 방법에 비해 견고성과 주석 효율성을 보이며, 복잡도를 신뢰할 수 있는 복잡도 프록시로 활용한다(예: 질문 길이, 수식 길이).
복잡도 기반 프롟 prompting은 매우 큰 모델에서 등장하는 능력이며, 기본 CoT 프롬프트에 비해 상당한 이점을 제공한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.