Skip to main content
QUICK REVIEW

[论文解读] EntropyPrune: Matrix Entropy Guided Visual Token Pruning for Multimodal Large Language Models

Yahong Wang, Jiande Wu|arXiv (Cornell University)|Feb 19, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

tldr: EntropyPrune uses a matrix entropy framework to identify an Entropy Collapse Layer and prune visual tokens without attention maps, achieving substantial FLOPs reduction with minimal accuracy loss across multiple MLLMs and modalities.

ABSTRACT

Multimodal large language models (MLLMs) incur substantial inference cost due to the processing of hundreds of visual tokens per image. Although token pruning has proven effective for accelerating inference, determining when and where to prune remains largely heuristic. Existing approaches typically rely on static, empirically selected layers, which limit interpretability and transferability across models. In this work, we introduce a matrix-entropy perspective and identify an "Entropy Collapse Layer" (ECL), where the information content of visual representations exhibits a sharp and consistent drop, which provides a principled criterion for selecting the pruning stage. Building on this observation, we propose EntropyPrune, a novel matrix-entropy-guided token pruning framework that quantifies the information value of individual visual tokens and prunes redundant ones without relying on attention maps. Moreover, to enable efficient computation, we exploit the spectral equivalence of dual Gram matrices, reducing the complexity of entropy computation and yielding up to a 64x theoretical speedup. Extensive experiments on diverse multimodal benchmarks demonstrate that EntropyPrune consistently outperforms state-of-the-art pruning methods in both accuracy and efficiency. On LLaVA-1.5-7B, our method achieves a 68.2% reduction in FLOPs while preserving 96.0% of the original performance. Furthermore, EntropyPrune generalizes effectively to high-resolution and video-based models, highlighting the strong robustness and scalability in practical MLLM acceleration. The code will be publicly available at https://github.com/YahongWang1/EntropyPrune.

研究动机与目标

  • 目标1: Investigate layer-wise information density in MLLMs to identify principled pruning stages.
  • 目标2: Develop a training-free token pruning method based on matrix entropy without relying on attention maps.
  • 目标3: Reduce inference cost by pruning redundant visual tokens while preserving model performance.
  • 目标4: Provide an efficient computation strategy by exploiting dual Gram matrices for entropy calculation.

提出的方法

  • 方法1: Define matrix entropy via trace-normalized covariance to quantify information content of visual tokens.
  • 方法2: Identify Entropy Collapse Layer where the entropy drops sharply to guide pruning stage.
  • 方法3: Score tokens by head-wise token covariance entropy and prune low-entropy tokens without using attention maps.
  • 方法4: Use a spectral acceleration strategy based on dual Gram matrices to compute entropy with O(h^3) complexity instead of O(d_h^3).
  • 方法5: Provide a theoretical FLOPs reduction analysis and show practical overhead is negligible.
Figure 1: (a) Comparison between vanilla LLaVA-1.5-7B and EntropyPrune. Correct answers are highlighted in green , while hallucinations are marked in red . By removing low-information tokens, EntropyPrune encourages the model to concentrate on more critical details ( e.g. , the person’s state and th
Figure 1: (a) Comparison between vanilla LLaVA-1.5-7B and EntropyPrune. Correct answers are highlighted in green , while hallucinations are marked in red . By removing low-information tokens, EntropyPrune encourages the model to concentrate on more critical details ( e.g. , the person’s state and th

实验结果

研究问题

  • RQ1研究问题1: What is the information-flow pattern of visual tokens across layers in MLLMs as captured by matrix entropy?
  • RQ2研究问题2: Can a principled pruning layer (Entropy Collapse Layer) improve token pruning decisions over heuristic layer selection?
  • RQ3研究问题3: Does entropy-based token scoring effectively retain important visual information while reducing token count?
  • RQ4研究问题4: How can entropy computations be accelerated to be practical for real-time MLLM inference?

主要发现

  • 关键发现1: Entropy Collapse Layer is a consistent pruning cue across models/datasets, marking where visual token information declines sharply.
  • 关键发现2: EntropyPrune achieves substantial token reduction (e.g., 66.7%–77.8% tokens) with minimal performance loss (e.g., ~1–2% on average) in image tasks.
  • 关键发现3: Spectral acceleration using dual Gram matrices yields up to 64× speedup in entropy computation.
  • 关键发现4: On LLaVA-1.5-7B, EntropyPrune reduces FLOPs by 68.2% while preserving 96.0% of the original performance without extra training.
  • 关键发现5: Generalizes to high-resolution and video models, maintaining robustness across diverse benchmarks.
  • 关键发现6: Outperforms state-of-the-art pruning methods across multiple baselines in accuracy-efficiency trade-offs.
Figure 2: Layer-wise matrix entropy of visual tokens (query and key states) in LLaVA-1.5-7B and LLaVA-Next-7B across eight datasets. A consistent layer-wise trend is observed across different datasets, with a precipitous entropy drop after the second layer.
Figure 2: Layer-wise matrix entropy of visual tokens (query and key states) in LLaVA-1.5-7B and LLaVA-Next-7B across eight datasets. A consistent layer-wise trend is observed across different datasets, with a precipitous entropy drop after the second layer.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。