QUICK REVIEW

[논문 리뷰] When More is Less: Understanding Chain-of-Thought Length in LLMs

Yuyang Wu, Yifei Wang|ArXiv.org|2025. 02. 11.

Scientific Computing and Data Management인용 수 3

한 줄 요약

논문은 더 긴 체인-오브-생각(CoT)이 항상 더 나은 것은 아님을 보여준다; 모델 능력과 과제 난이도에 따라 최적 CoT 길이가 존재하며 이론과 실험으로 뒷받침되고, 추론에 최적 CoT을 활용하기 위한 Length-filtered Vote를 제안한다.

ABSTRACT

Large Language Models (LLMs) employ Chain-of-Thought (CoT) reasoning to deconstruct complex problems. While longer CoTs are often presumed superior, this paper challenges that notion, arguing that longer is not always better. Drawing on combined evidence from real-world observations, controlled experiments, and theoretical analysis, we demonstrate that task accuracy typically follows an inverted U-shaped curve with CoT length, where performance initially improves but eventually decreases as the number of CoT steps increases. With controlled experiments, we further uncover the scaling behaviors of the optimal CoT length: it increases with task difficulty but decreases with model capability, exposing an inherent simplicity bias where more capable models favor shorter, more efficient CoT reasoning. This bias is also evident in Reinforcement Learning (RL) training, where models gravitate towards shorter CoTs as their accuracy improves. To have a deep understanding of these dynamics, we establish a simple theoretical model that formally proves these phenomena, including the optimal length's scaling laws and the emergence of simplicity bias during RL. Guided by this framework, we demonstrate significant practical benefits from training with optimally-lengthed CoTs and employing length-aware filtering at inference. These findings offer both a principled understanding of the "overthinking" phenomenon and multiple practical guidelines for CoT calibration, enabling LLMs to achieve optimal reasoning performance with adaptive CoTs tailored to task complexity and model capability.

연구 동기 및 목표

Motivate the investigation of how CoT length impacts multi-step reasoning in LLMs across model sizes and task difficulties.
Characterize the non-monotonic relationship between CoT length and final accuracy.
Develop a theoretical framework for the existence of an optimal CoT length and its scaling with model capability and task difficulty.
Empirically validate theoretical insights on synthetic arithmetic tasks and real-world datasets (MATH) and show training/inference benefits of optimal CoT lengths.

제안 방법

Define a controlled synthetic arithmetic task as a binary tree with depth T and fixed per-step lengths t for t-hop CoT solutions.
Model CoT as an N-step process with t = ceil(T/N) operators per step and insert control tokens to enforce CoT length.
Train GPT-2 variants with different layer counts to study how model capability M affects optimal CoT length.
Prove a differentiable final-accuracy function A(N) = alpha * ((1 - E(N,M,T)) (1 - sigma(T)))^N and derive the optimal N(M,T) under simplified and extended error models.
Empirically validate on real LLMs (MATH dataset with Qwen2.5 series) and examine training with optimal CoT lengths versus random lengths.
Propose Length-filtered Vote, an inference method that selects among CoT lengths based on prediction uncertainty via entropy across length-based groups.

실험 결과

연구 질문

RQ1Does increasing CoT length consistently improve reasoning performance across model sizes and task difficulties?
RQ2What is the relationship between model capability, task difficulty, and the optimal CoT length?
RQ3Can a theoretical framework predict the optimal CoT length, and can it be observed empirically on synthetic and real-world data?
RQ4Can training or inference procedures leverage optimal CoT length to improve performance, possibly with smaller models?
RQ5Is a length-aware inference method (Length-filtered Vote) effective in practice across datasets and models?

주요 결과

There is a non-monotonic relationship between CoT length and final accuracy: longer CoT can initially improve but eventually degrade performance.
The optimal CoT length increases with task difficulty but decreases with model size; stronger models require fewer steps.
A theoretical framework shows an optimal N(M,T) exists and depends on model capability and task difficulty, with an eventual loss as N grows without bound.
On real math problems (MATH) larger models favor shorter optimal CoT lengths, and optimal length correlates with task difficulty.
Training on data with optimal CoT lengths can yield strong performance, sometimes surpassing larger models trained on random CoT lengths.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.