[论文解读] EntropyPrune: Matrix Entropy Guided Visual Token Pruning for Multimodal Large Language Models
tldr: EntropyPrune uses a matrix entropy framework to identify an Entropy Collapse Layer and prune visual tokens without attention maps, achieving substantial FLOPs reduction with minimal accuracy loss across multiple MLLMs and modalities.
Multimodal large language models (MLLMs) incur substantial inference cost due to the processing of hundreds of visual tokens per image. Although token pruning has proven effective for accelerating inference, determining when and where to prune remains largely heuristic. Existing approaches typically rely on static, empirically selected layers, which limit interpretability and transferability across models. In this work, we introduce a matrix-entropy perspective and identify an "Entropy Collapse Layer" (ECL), where the information content of visual representations exhibits a sharp and consistent drop, which provides a principled criterion for selecting the pruning stage. Building on this observation, we propose EntropyPrune, a novel matrix-entropy-guided token pruning framework that quantifies the information value of individual visual tokens and prunes redundant ones without relying on attention maps. Moreover, to enable efficient computation, we exploit the spectral equivalence of dual Gram matrices, reducing the complexity of entropy computation and yielding up to a 64x theoretical speedup. Extensive experiments on diverse multimodal benchmarks demonstrate that EntropyPrune consistently outperforms state-of-the-art pruning methods in both accuracy and efficiency. On LLaVA-1.5-7B, our method achieves a 68.2% reduction in FLOPs while preserving 96.0% of the original performance. Furthermore, EntropyPrune generalizes effectively to high-resolution and video-based models, highlighting the strong robustness and scalability in practical MLLM acceleration. The code will be publicly available at https://github.com/YahongWang1/EntropyPrune.
研究动机与目标
- 目标1: Investigate layer-wise information density in MLLMs to identify principled pruning stages.
- 目标2: Develop a training-free token pruning method based on matrix entropy without relying on attention maps.
- 目标3: Reduce inference cost by pruning redundant visual tokens while preserving model performance.
- 目标4: Provide an efficient computation strategy by exploiting dual Gram matrices for entropy calculation.
提出的方法
- 方法1: Define matrix entropy via trace-normalized covariance to quantify information content of visual tokens.
- 方法2: Identify Entropy Collapse Layer where the entropy drops sharply to guide pruning stage.
- 方法3: Score tokens by head-wise token covariance entropy and prune low-entropy tokens without using attention maps.
- 方法4: Use a spectral acceleration strategy based on dual Gram matrices to compute entropy with O(h^3) complexity instead of O(d_h^3).
- 方法5: Provide a theoretical FLOPs reduction analysis and show practical overhead is negligible.

实验结果
研究问题
- RQ1研究问题1: What is the information-flow pattern of visual tokens across layers in MLLMs as captured by matrix entropy?
- RQ2研究问题2: Can a principled pruning layer (Entropy Collapse Layer) improve token pruning decisions over heuristic layer selection?
- RQ3研究问题3: Does entropy-based token scoring effectively retain important visual information while reducing token count?
- RQ4研究问题4: How can entropy computations be accelerated to be practical for real-time MLLM inference?
主要发现
- 关键发现1: Entropy Collapse Layer is a consistent pruning cue across models/datasets, marking where visual token information declines sharply.
- 关键发现2: EntropyPrune achieves substantial token reduction (e.g., 66.7%–77.8% tokens) with minimal performance loss (e.g., ~1–2% on average) in image tasks.
- 关键发现3: Spectral acceleration using dual Gram matrices yields up to 64× speedup in entropy computation.
- 关键发现4: On LLaVA-1.5-7B, EntropyPrune reduces FLOPs by 68.2% while preserving 96.0% of the original performance without extra training.
- 关键发现5: Generalizes to high-resolution and video models, maintaining robustness across diverse benchmarks.
- 关键发现6: Outperforms state-of-the-art pruning methods across multiple baselines in accuracy-efficiency trade-offs.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。