Skip to main content
QUICK REVIEW

[论文解读] Fuel Gauge: Estimating Chain-of-Thought Length Ahead of Time in Large Multimodal Models

Yuedong Yang, Xiwen Wei|arXiv (Cornell University)|Mar 11, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

本文提出了 Fuel Gauge,一种轻量级预测器,在生成前估计大型多模态模型的连锁思考(CoT)长度,从而实现预测性 KV 缓存分配和 CoT 长度调控,以提升效率和准确性。

ABSTRACT

Reasoning Large Multi-modality Models (LMMs) have become the de facto choice for many applications. However, these models rely on a Chain-of-Thought (CoT) process that is lengthy and unpredictable at runtime, often resulting in inefficient use of computational resources (due to memory fragmentation) and sub-optimal accuracy (due to under- and over-thinking). We observe empirically that the CoT process follows a very simple form, whose behavior is independent of the specific generated samples. This suggests that the CoT length can be estimated ahead of time based on a hidden parameter representing the amount of "fuel" available to support the reasoning process. Based on this insight, we propose Fuel Gauge, the first method which extracts this hidden signal and predicts CoT length ahead of time. We demonstrate the utility on the Fuel Gauge on two downstream tasks: predictive KV cache allocation, which addresses memory fragmentation in LMM serving systems, and CoT length modulation, which mitigates under-thinking and over-thinking. Extensive experiments on LMMs across text-only, image-text, and video-text question answering benchmarks demonstrate the effectiveness, generalizability, and practical value of our Fuel Gauge. For example, on the GPQA-Diamond benchmark, our Fuel Gauge achieves less than half the CoT length prediction error compared to the baseline; this translates into a 13.37x reduction in the memory allocation frequency.

研究动机与目标

  • Motivate and address inefficiencies from unpredictable CoT lengths in LMMs (memory fragmentation, sub-/over-thinking).
  • Propose a framework to predict CoT length ahead of time using an internal fuel-level signal.
  • Demonstrate two practical applications: predictive KV cache allocation and CoT length modulation.
  • Validate generalizability across text-only, image-text, and video-text benchmarks.

提出的方法

  • Characterize CoT length as a Bernoulli-like process and hypothesize predictability from input prompts.
  • Postulate an internal fuel-level signal that decreases as reasoning progresses and can be mapped to a scalar fuel level.
  • Develop a two-stage prediction: Stage 1 extract hidden signal S_i and estimate fuel level r_i; Stage 2 fit a linear model to extrapolate the CoT length where fuel reaches zero.
  • Implement lightweight neural components: f_sig (1D depth-wise + 1D point-wise conv) and f_fuel (2-layer MLP) with low overhead.
  • Train f_sig and f_fuel on 200 CoT traces from MMLU/MMMU and evaluate using relative mean absolute error (rMAE) against ground truth N.
  • Apply Fuel Gauge to predictive KV cache allocation to reduce memory allocations and to CoT length modulation via gradient-based, normalized updates to h_i to achieve target fuel levels.
Figure 1 : Example of the output of reasoning LMM, which consists of a long CoT section wrapped with special symbols “<think>” and “</think>”, and a short Conclusion section.
Figure 1 : Example of the output of reasoning LMM, which consists of a long CoT section wrapped with special symbols “<think>” and “</think>”, and a short Conclusion section.

实验结果

研究问题

  • RQ1Can CoT length be predicted before CoT generation using an input-prompt conditioned parameter?
  • RQ2Is there an internal fuel-level signal in LMMs that correlates with CoT progression and can be estimated from hidden states?
  • RQ3Can a compact predictor accurately estimate CoT length at runtime to enable practical downstream controls?
  • RQ4Do predictive CoT length estimates translate into tangible improvements in memory efficiency and reasoning control across modalities?

主要发现

  • Fuel Gauge significantly outperforms baselines in fuel-level estimation (lower rMAE than End-of-CoT probability or mean/median baselines).
  • CoT length can be predicted with strong generalization across text-only, image-text, and video-text benchmarks, with improvements over baselines on GPQA-Diamond and MathVision-m tasks.
  • Using Fuel Gauge for predictive KV cache allocation yields substantially fewer memory allocations and reduced fragmentation (e.g., up to 13.37× reduction in memory allocations on certain benchmarks).
  • CoT length modulation guided by the Fuel Gauge yields linear control over CoT length and model accuracy across multiple models and benchmarks.
  • Stage-wise design (fuel level extraction followed by linear extrapolation) enables runtime CoT length estimates with negligible overhead.
  • Training on 200 CoT traces suffices to generalize across tasks and modalities, demonstrating practical generalizability.
Figure 2 : Correlation between Chain-of-Thoughts (CoT) and LMM accuracy collected from Qwen3 [ 1 ] , Qwen3VL [ 6 ] , Intern-S1 [ 3 ] and GLM [ 31 ] across multiple text-only, image-text and video-text benchmarks. Using accuracy as a proxy for task difficulty, we observe a clear negative correlation
Figure 2 : Correlation between Chain-of-Thoughts (CoT) and LMM accuracy collected from Qwen3 [ 1 ] , Qwen3VL [ 6 ] , Intern-S1 [ 3 ] and GLM [ 31 ] across multiple text-only, image-text and video-text benchmarks. Using accuracy as a proxy for task difficulty, we observe a clear negative correlation

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。