QUICK REVIEW

[論文レビュー] EntropyPrune: Matrix Entropy Guided Visual Token Pruning for Multimodal Large Language Models

Yahong Wang, Jiande Wu|arXiv (Cornell University)|Feb 19, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

EntropyPrune は行列エントロピーの枠組みを用いて Entropy Collapse Layer を識別し、注意マップなしで視覚トークンを剪定する。複数の MLLM とモダリティにわたり、精度の損失を最小限に抑えつつ FLOPs を substantial に削減。

ABSTRACT

Multimodal large language models (MLLMs) incur substantial inference cost due to the processing of hundreds of visual tokens per image. Although token pruning has proven effective for accelerating inference, determining when and where to prune remains largely heuristic. Existing approaches typically rely on static, empirically selected layers, which limit interpretability and transferability across models. In this work, we introduce a matrix-entropy perspective and identify an "Entropy Collapse Layer" (ECL), where the information content of visual representations exhibits a sharp and consistent drop, which provides a principled criterion for selecting the pruning stage. Building on this observation, we propose EntropyPrune, a novel matrix-entropy-guided token pruning framework that quantifies the information value of individual visual tokens and prunes redundant ones without relying on attention maps. Moreover, to enable efficient computation, we exploit the spectral equivalence of dual Gram matrices, reducing the complexity of entropy computation and yielding up to a 64x theoretical speedup. Extensive experiments on diverse multimodal benchmarks demonstrate that EntropyPrune consistently outperforms state-of-the-art pruning methods in both accuracy and efficiency. On LLaVA-1.5-7B, our method achieves a 68.2% reduction in FLOPs while preserving 96.0% of the original performance. Furthermore, EntropyPrune generalizes effectively to high-resolution and video-based models, highlighting the strong robustness and scalability in practical MLLM acceleration. The code will be publicly available at https://github.com/YahongWang1/EntropyPrune.

研究の動機と目的

MLLM における層ごとの情報密度を調べ、原理的な剪定段階を特定する。
Attention マップに依存せず、行列エントロピーに基づくトークン pruning 手法を訓練不要で開発する。
冗長な視覚トークンを剪定しつつモデル性能を維持することで推論コストを削減する。
エントロピー計算を双対 Gram 行列を活用して効率的に行い、計算を O(h^3) から実現可能な複雑さにする。

提案手法

視覚トークンの情報量を定量化するために、trace-normalized covariance を用いて matrix entropy を定義する。
Entropy Collapse Layer を特定し、エントロピーが急激に低下する箇所を剪定段階の指標として用いる。
ヘッドごとのトークン共分散エントロピーでトークンをスコア付けし、注意マップを使用せず低エントロピーのトークンを剪定する。
双対 Gram 行列に基づくスペクトル加速戦略を用い、エントロピー計算を O(h^3) ではなく O(d_h^3) の複雑さで実現する。
理論的な FLOPs 削減分析を提供し、実際のオーバーヘッドがごく小さいことを示す。

Figure 1: (a) Comparison between vanilla LLaVA-1.5-7B and EntropyPrune. Correct answers are highlighted in green , while hallucinations are marked in red . By removing low-information tokens, EntropyPrune encourages the model to concentrate on more critical details ( e.g. , the person’s state and th

実験結果

リサーチクエスチョン

RQ1Matrix entropy によって捉えられるMLLMの層間での視覚トークンの情報フローのパターンは何か。
RQ2原理的な剪定層（Entropy Collapse Layer）はヒューリスティックな層選択よりトークン剪定の意思決定を改善できるか。
RQ3エントロピーに基づくトークンスコアリングは、トークン数を削減しつつ重要な視覚情報を効果的に保持するか。
RQ4エントロピー計算を実時間の MLLM 推論に実用的な速度へどう加速できるか。

主な発見

Entropy Collapse Layer はモデル/データセットを超えて一貫した剪定手がかりであり、視覚トークン情報が急激に低下する箇所を示す。
EntropyPrune は画像タスクでトークンを大幅に削減（例：66.7%–77.8% のトークン）し、性能損失を最小限に抑える（平均 ~1–2% 程度）。
双対 Gram 行列を用いたスペクトル加速により、エントロピー計算が最大 64× の速度 up。
LLaVA-1.5-7B では EntropyPrune が FLOPs を 68.2% 削減し、追加訓練なしで元の性能の 96.0% を維持。
高解像度およびビデオモデルにも一般化し、多様なベンチマークで頑健性を維持。
複数のベースラインにおいて、精度対効率のトレードオフで最先端剪定法を上回る。

Figure 2: Layer-wise matrix entropy of visual tokens (query and key states) in LLaVA-1.5-7B and LLaVA-Next-7B across eight datasets. A consistent layer-wise trend is observed across different datasets, with a precipitous entropy drop after the second layer.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。