QUICK REVIEW

[논문 리뷰] To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

Zayne Sprague, Fangcong Yin|arXiv (Cornell University)|2024. 09. 18.

Computability, Logic, AI Algorithms인용 수 9

한 줄 요약

메타 분석과 실험은 Chain-of-Thought (CoT) 프롬프트가 주로 수학 및 기호적 추론에 도움을 주는 반면, 비기호적 작업에서는 CoT의 이점이 거의 없거나 없을 수 있으며 도구 보강 해결에 의해 능가될 수 있음을 보여준다. 이 연구는 CoT의 선택적 사용을 권고하고 프롬프트를 넘어 중간 계산을 활용하는 대안을 요구한다.

ABSTRACT

Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra ``thinking'' really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model's response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT's gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.

연구 동기 및 목표

광범위한 과제와 모델 집합에서 프롬프트 기반 CoT가 성능을 향상시키는 시점을 평가한다.
문헌 메타분석과 새로운 실험을 통해 기호적, 수학적, 논리적 및 비기호적 영역에서 CoT의 영향을 정량화한다.
기호적 추론에서 CoT가 어디서 가치를 더하는지 이해하기 위해 계획과 실행의 분리를 분석한다.
도구 보강 방식과 CoT를 비교하여 상대적 강점과 한계를 판단한다.
프롬프트 기반 CoT를 넘어서 중간 계산을 보다 효과적으로 활용할 방향을 제안한다.

제안 방법

2024 ICLR/NAACL/EACL 학술대회에 걸친 110편의 논문(14개 모델, 264개 데이터셋)에서 1,218건의 CoT 대 Direct-Answer 비교에 대한 체계적 메타분석.
작업을 14개 범주로 분류(예: 기호적/알고리즘적, 수학, 논리적 추론, 백과사전 지식, 혼합 데이터셋 등).
제로샷 및Few-shot 프롬핑하에 20개 데이터셋에서 14개의 현대 LLM을 대상으로 한 대규모 실험.
평가에는 제로샷 CoT 대 직접 프롬프트를 포함하며, 출력에 등호 기호가 포함되는지(기호적 연산 여부)에 주의한다.
기호적 계획을 생성하고 Plan+Direct Solver, Plan+CoT Solver, Plan+Tool Solver 구성으로 계획과 실행을 분리하는 조사를 수행한다.

Figure 1: Left: meta-analysis of CoT literature; each point is a reported delta of CoT over direct answering for some (LLM, task) pair. Right: average performance of using zero-shot CoT v.s. direct answer prompts across five general reasoning categories, covering 20 datasets with 14 LLMs evaluated o

실험 결과

연구 질문

RQ1어떤 작업 유형(기호적, 수학적, 논리적, 비기호적)이 Chain-of-Thought 프롬핑의 혜택을 받는가?
RQ2CoT가 데이터셋과 모델 전반에 걸쳐 성능을 얼마나 향상시키는지, 그리고 이것이 직접 프롬프트와 어떻게 비교되는지?
RQ3계획과 실행의 분리(외부 도구 사용)가 기호적 추론 과제에서 CoT를 능가할 수 있는가?
RQ4대체 방법과 비교했을 때 추론 비용 측면에서 CoT가 비용 효율적인가?
RQ5프롬프트 기반 CoT를 넘어서 보다 통합된 추론 패러다임으로의 이동에 대한 시사점은 무엇인가?

주요 결과

CoT는 주로 수학, 기호적 추론 및 논리적 추론 과제에서 상당한 이득을 제공한다.
문헌과 실험 전반에 걸쳐 비기호적 과제는 CoT의 이점이 거의 없거나 없으며, 이러한 과제 다수에서 직접 프롬프트가 비슷한 성능을 보인다.
MMLU에서 CoT 이득의 최대 95%가 등호 기호를 포함하는 질문이나 출력에서 나오는 것으로, 즉 기호적 추론과 관련있다.
계획과 실행의 분리는 CoT가 실행을 개선함을 보여주지만 외부 기호적 솔버가 계획 및 실행 모두에서 CoT를 능가할 수 있다.
도구 보강 해결(Plan+Tool Solver)은 기호적 도메인에서 종종 Plan+CoT를 능가하여 외부 도구가 없으면 CoT의 한계를 시사한다.
전반적으로 CoT는 선택적이고 비용 절감적일 수 있으며, 중간 계산을 활용하는 프롬프트 기반 CoT를 넘어서는 접근을 추구한다.

Figure 2: Results from our meta-analysis (grey dots) aggregated by paper and category (blue dots).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.