Skip to main content
QUICK REVIEW

[論文レビュー] Chain of Thoughtlessness? An Analysis of CoT in Planning

Kaya Stechly, Karthik Valmeekam|arXiv (Cornell University)|May 8, 2024
Educational Tools and Methods被引用数 5
ひとこと要約

本研究は、チェーン・オブ・トゥート(CoT)プロンプトは計画において一般化することはまれであり、利得は高度に問題特化したプロンプトに限定され、問題サイズが大きくなると劣化する、CoTは一般的なアルゴリズムを学習するというよりパターンマッチングに依存していることを示唆する。

ABSTRACT

Large language model (LLM) performance on reasoning problems typically does not generalize out of distribution. Previous work has claimed that this can be mitigated with chain of thought prompting-a method of demonstrating solution procedures-with the intuition that it is possible to in-context teach an LLM an algorithm for solving the problem. This paper presents a case study of chain of thought on problems from Blocksworld, a classical planning domain, and examines the performance of two state-of-the-art LLMs across two axes: generality of examples given in prompt, and complexity of problems queried with each prompt. While our problems are very simple, we only find meaningful performance improvements from chain of thought prompts when those prompts are exceedingly specific to their problem class, and that those improvements quickly deteriorate as the size n of the query-specified stack grows past the size of stacks shown in the examples. We also create scalable variants of three domains commonly studied in previous CoT papers and demonstrate the existence of similar failure modes. Our results hint that, contrary to previous claims in the literature, CoT's performance improvements do not stem from the model learning general algorithmic procedures via demonstrations but depend on carefully engineering highly problem specific prompts. This spotlights drawbacks of chain of thought, especially the sharp tradeoff between possible performance gains and the amount of human labor necessary to generate examples with correct reasoning traces.

研究の動機と目的

  • Evaluate whether chain-of-thought prompts enable generalizable reasoning in large language models for planning tasks.
  • Assess how prompt specificity and problem size affect CoT effectiveness in planning.
  • Compare performance across GPT-4, GPT-4-Turbo, and Claude-3-Opus on planning tasks.
  • Examine whether CoTPrompts teach general algorithms or merely enable pattern matching in planning and related benchmarks.

提案手法

  • Use Blocksworld PDDL instances with stack heights ranging from small to large (3–20 blocks).
  • Translate model outputs to PDDL and verify plans with VAL.
  • Test multiple models (GPT-4, GPT-4-Turbo, Claude-3-Opus) under zero-shot, n-shot, and chain-of-thought prompts (including domain-specific and stacking variants).
  • Extend evaluation to scalable synthetic benchmarks (CoinFlip, LastLetterConcatenation, multi-step arithmetic) to assess generalization of CoT.
  • Analyze accuracy as problem class and prompt granularity vary, and examine self-consistency effects.

実験結果

リサーチクエスチョン

  • RQ1Does chain-of-thought prompting improve out-of-distribution generalization for planning problems in LLMs?
  • RQ2How does prompt granularity/generality affect performance as problem size increases in Blocksworld and its variants?
  • RQ3Do LLMs learn general algorithmic procedures from CoT demonstrations across planning tasks, or is improvement due to pattern matching?
  • RQ4Do findings extend to scalable synthetic benchmarks commonly used in CoT research?

主な発見

プロンプトGPT-4-TurboClaude-3-OpusGPT-4
zero-shot19.1%9.96%3.83%
zero-shot CoT21%10.34%4.98%
Domain-Specific n-shot13.7%16.4%6.13%
Progression Proof CoT15.3%4.59%6.89%
Domain-Specific n-shot13.7%16.4%6.13%
Blocksworld Universal Algorithm37.1%37.1%51.3%
Problem Class Specific n-shot18%15.7%8.81%
Stacking Prompt40.6%24.5%59.3%
  • CoT prompts yield meaningful improvements only for the narrowest problem distributions; gains vanish as the height of the goal stack grows.
  • Increasing prompt generality often reduces performance even on small problems, and can underperform standard prompting.
  • Self-consistency and other advanced CoT variants show similar or worse results in this setting.
  • Table-to-stack prompts with high specificity can retain improvement, but there is no robust generalization to larger instances.
  • Extending CoT to scalable synthetic benchmarks (CoinFlip, LastLetterConcatenation, arithmetic) reveals similar brittleness, with improvements largely due to syntactic pattern matching rather than learning general algorithms.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。