Skip to main content
QUICK REVIEW

[Paper Review] Reasoning about Reasoning: BAPO Bounds on Chain-of-Thought Token Complexity in LLMs

Kiran Tomlinson, Tobias Schnabel|arXiv (Cornell University)|Feb 2, 2026
Advanced Graph Neural Networks0 citations
TL;DR

The paper proves Omega(n) lower bounds on chain-of-thought tokens for three BAPO-hard tasks and shows near-linear token scaling in frontier models, highlighting fundamental limits on inference-time reasoning costs.

ABSTRACT

Inference-time scaling via chain-of-thought (CoT) reasoning is a major driver of state-of-the-art LLM performance, but it comes with substantial latency and compute costs. We address a fundamental theoretical question: how many reasoning tokens are required to solve a problem as input size grows? By extending the bounded attention prefix oracle (BAPO) model--an abstraction of LLMs that quantifies the information flow required to solve a task--we prove lower bounds on the CoT tokens required for three canonical BAPO-hard tasks: binary majority, triplet matching, and graph reachability. We show that each requires $Ω(n)$ reasoning tokens when the input size is $n$. We complement these results with matching or near-matching upper bounds via explicit constructions. Finally, our experiments with frontier reasoning models show approximately linear reasoning token scaling on these tasks and failures when constrained to smaller reasoning budgets, consistent with our theoretical lower bounds. Together, our results identify fundamental bottlenecks in inference-time compute through CoT and offer a principled tool for analyzing optimal reasoning length.

Motivation & Objective

  • Motivate inference-time scaling challenges of chain-of-thought (CoT) reasoning in LLMs due to compute and latency costs.
  • Introduce and extend the BAPO model to quantify information flow and reasoning token requirements.
  • Establish lower bounds on CoT tokens for canonical BAPO-hard tasks as input size grows.
  • Provide matching or near-matching upper bounds with explicit constructions.
  • Validate theoretical findings with experiments on frontier reasoning models showing linear-like token scaling.

Proposed method

  • Extend the bounded attention prefix oracle (BAPO) model to quantify information flow in solving tasks.
  • Derive lower bounds on the number of reasoning tokens required for three canonical BAPO-hard tasks: binary majority, triplet matching, and graph reachability.
  • Provide explicit constructions to obtain matching or near-matching upper bounds on reasoning token requirements.
  • Conduct experiments with frontier reasoning models to observe token scaling and the impact of limited reasoning budgets.

Experimental results

Research questions

  • RQ1How many reasoning tokens are necessary as input size grows for canonical BAPO-hard tasks?
  • RQ2Do lower bounds on CoT token requirements hold for binary majority, triplet matching, and graph reachability?
  • RQ3Can we construct upper bounds that match the lower bounds to characterize the token economy of CoT reasoning?
  • RQ4Do frontier reasoning models exhibit approximately linear scaling of reasoning tokens consistent with the theoretical bounds?
  • RQ5What implications do these bounds have for inference-time compute and optimization of CoT strategies?

Key findings

  • Each of the three canonical BAPO-hard tasks requires Omega(n) reasoning tokens as input size is n.
  • Explicit constructions yield matching or near-matching upper bounds for the token requirements.
  • Experiments with frontier reasoning models show approximately linear reasoning token scaling on these tasks.
  • Results align with the theoretical lower bounds and demonstrate failures when constrained to smaller reasoning budgets.
  • The work identifies fundamental bottlenecks in inference-time compute through CoT and provides a tool for analyzing optimal reasoning length.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.