[논문 리뷰] Chain of Thoughtlessness? An Analysis of CoT in Planning
이 연구는 사고 흐름 프롬프트가 Planning에서 일반화에 거의 기여하지 않으며, 이득은 매우 문제에 특화된 프롬프트에 제한되고 문제 크기가 커질수록 악화된다고 제시하며, CoT는 일반 알고리즘 학습이 아니라 패턴 매칭에 의존한다는 것을 시사한다.
Large language model (LLM) performance on reasoning problems typically does not generalize out of distribution. Previous work has claimed that this can be mitigated with chain of thought prompting-a method of demonstrating solution procedures-with the intuition that it is possible to in-context teach an LLM an algorithm for solving the problem. This paper presents a case study of chain of thought on problems from Blocksworld, a classical planning domain, and examines the performance of two state-of-the-art LLMs across two axes: generality of examples given in prompt, and complexity of problems queried with each prompt. While our problems are very simple, we only find meaningful performance improvements from chain of thought prompts when those prompts are exceedingly specific to their problem class, and that those improvements quickly deteriorate as the size n of the query-specified stack grows past the size of stacks shown in the examples. We also create scalable variants of three domains commonly studied in previous CoT papers and demonstrate the existence of similar failure modes. Our results hint that, contrary to previous claims in the literature, CoT's performance improvements do not stem from the model learning general algorithmic procedures via demonstrations but depend on carefully engineering highly problem specific prompts. This spotlights drawbacks of chain of thought, especially the sharp tradeoff between possible performance gains and the amount of human labor necessary to generate examples with correct reasoning traces.
연구 동기 및 목표
- Evaluate whether chain-of-thought prompts enable generalizable reasoning in large language models for planning tasks.
- Assess how prompt specificity and problem size affect CoT effectiveness in planning.
- Compare performance across GPT-4, GPT-4-Turbo, and Claude-3-Opus on planning tasks.
- Examine whether CoTPrompts teach general algorithms or merely enable pattern matching in planning and related benchmarks.
제안 방법
- Use Blocksworld PDDL instances with stack heights ranging from small to large (3–20 blocks).
- Translate model outputs to PDDL and verify plans with VAL.
- Test multiple models (GPT-4, GPT-4-Turbo, Claude-3-Opus) under zero-shot, n-shot, and chain-of-thought prompts (including domain-specific and stacking variants).
- Extend evaluation to scalable synthetic benchmarks (CoinFlip, LastLetterConcatenation, multi-step arithmetic) to assess generalization of CoT.
- Analyze accuracy as problem class and prompt granularity vary, and examine self-consistency effects.
실험 결과
연구 질문
- RQ1Does chain-of-thought prompting improve out-of-distribution generalization for planning problems in LLMs?
- RQ2How does prompt granularity/generality affect performance as problem size increases in Blocksworld and its variants?
- RQ3Do LLMs learn general algorithmic procedures from CoT demonstrations across planning tasks, or is improvement due to pattern matching?
- RQ4Do findings extend to scalable synthetic benchmarks commonly used in CoT research?
주요 결과
| 프롬프트 | GPT-4-Turbo | Claude-3-Opus | GPT-4 |
|---|---|---|---|
| 제로샷 | 19.1% | 9.96% | 3.83% |
| 제로샷 CoT | 21% | 10.34% | 4.98% |
| 도메인 특화 n-shot | 13.7% | 16.4% | 6.13% |
| 진행 증명 CoT | 15.3% | 4.59% | 6.89% |
| 도메인 특화 n-shot | 13.7% | 16.4% | 6.13% |
| 블록월드 범용 알고리즘 | 37.1% | 37.1% | 51.3% |
| 문제 클래스로 특화된 n-shot | 18% | 15.7% | 8.81% |
| 스택 프롬프트 | 40.6% | 24.5% | 59.3% |
- CoT prompts yield meaningful improvements only for the narrowest problem distributions; gains vanish as the height of the goal stack grows.
- Increasing prompt generality often reduces performance even on small problems, and can underperform standard prompting.
- Self-consistency and other advanced CoT variants show similar or worse results in this setting.
- Table-to-stack prompts with high specificity can retain improvement, but there is no robust generalization to larger instances.
- Extending CoT to scalable synthetic benchmarks (CoinFlip, LastLetterConcatenation, arithmetic) reveals similar brittleness, with improvements largely due to syntactic pattern matching rather than learning general algorithms.]
- table_headers: [
- table_headers translational mapping needed
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.