QUICK REVIEW

[论文解读] When More is Less: Understanding Chain-of-Thought Length in LLMs

Yuyang Wu, Yifei Wang|ArXiv.org|Feb 11, 2025

Scientific Computing and Data Management被引用 3

一句话总结

该论文表明更长的链式推理（CoT）并不总是更好；存在一个依赖于模型能力和任务难度的最优CoT长度，理论与实验提供支持，并提出了长度筛选投票以在推理中利用最优CoT长度。

ABSTRACT

Large Language Models (LLMs) employ Chain-of-Thought (CoT) reasoning to deconstruct complex problems. While longer CoTs are often presumed superior, this paper challenges that notion, arguing that longer is not always better. Drawing on combined evidence from real-world observations, controlled experiments, and theoretical analysis, we demonstrate that task accuracy typically follows an inverted U-shaped curve with CoT length, where performance initially improves but eventually decreases as the number of CoT steps increases. With controlled experiments, we further uncover the scaling behaviors of the optimal CoT length: it increases with task difficulty but decreases with model capability, exposing an inherent simplicity bias where more capable models favor shorter, more efficient CoT reasoning. This bias is also evident in Reinforcement Learning (RL) training, where models gravitate towards shorter CoTs as their accuracy improves. To have a deep understanding of these dynamics, we establish a simple theoretical model that formally proves these phenomena, including the optimal length's scaling laws and the emergence of simplicity bias during RL. Guided by this framework, we demonstrate significant practical benefits from training with optimally-lengthed CoTs and employing length-aware filtering at inference. These findings offer both a principled understanding of the "overthinking" phenomenon and multiple practical guidelines for CoT calibration, enabling LLMs to achieve optimal reasoning performance with adaptive CoTs tailored to task complexity and model capability.

研究动机与目标

Motivate the investigation of how CoT length impacts multi-step reasoning in LLMs across model sizes and task difficulties.
Characterize the non-monotonic relationship between CoT length and final accuracy.
Develop a theoretical framework for the existence of an optimal CoT length and its scaling with model capability and task difficulty.
Empirically validate theoretical insights on synthetic arithmetic tasks and real-world datasets (MATH) and show training/inference benefits of optimal CoT lengths.

提出的方法

Define a controlled synthetic arithmetic task as a binary tree with depth T and fixed per-step lengths t for t-hop CoT solutions.
Model CoT as an N-step process with t = ceil(T/N) operators per step and insert control tokens to enforce CoT length.
Train GPT-2 variants with different layer counts to study how model capability M affects optimal CoT length.
Prove a differentiable final-accuracy function A(N) = alpha * ((1 - E(N,M,T)) (1 - sigma(T)))^N and derive the optimal N(M,T) under simplified and extended error models.
Empirically validate on real LLMs (MATH dataset with Qwen2.5 series) and examine training with optimal CoT lengths versus random lengths.
Propose Length-filtered Vote, an inference method that selects among CoT lengths based on prediction uncertainty via entropy across length-based groups.

实验结果

研究问题

RQ1Does increasing CoT length consistently improve reasoning performance across model sizes and task difficulties?
RQ2What is the relationship between model capability, task difficulty, and the optimal CoT length?
RQ3Can a theoretical framework predict the optimal CoT length, and can it be observed empirically on synthetic and real-world data?
RQ4Can training or inference procedures leverage optimal CoT length to improve performance, possibly with smaller models?
RQ5Is a length-aware inference method (Length-filtered Vote) effective in practice across datasets and models?

主要发现

There is a non-monotonic relationship between CoT length and final accuracy: longer CoT can initially improve but eventually degrade performance.
The optimal CoT length increases with task difficulty but decreases with model size; stronger models require fewer steps.
A theoretical framework shows an optimal N(M,T) exists and depends on model capability and task difficulty, with an eventual loss as N grows without bound.
On real math problems (MATH) larger models favor shorter optimal CoT lengths, and optimal length correlates with task difficulty.
Training on data with optimal CoT lengths can yield strong performance, sometimes surpassing larger models trained on random CoT lengths.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。