QUICK REVIEW

[論文レビュー] PIO-FVLM: Rethinking Training-Free Visual Token Reduction for VLM Acceleration from an Inference-Objective Perspective

Haokui Zhang, Congyang Ou|arXiv (Cornell University)|Feb 4, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

PIO-FVLM は勾配顕著性と NMS ベースの戦略でトークンを選択する、トレーニング不要の推論目的の視覚トークン削減法を vision-language モデルに提案し、フラッシュアテンションと互換性を保ちつつ精度の低下を最小に抑えつつ大幅なスピードアップを実現します。

ABSTRACT

Recently, reducing redundant visual tokens in vision-language models (VLMs) to accelerate VLM inference has emerged as a hot topic. However, most existing methods rely on heuristics constructed based on inter-visual-token similarity or cross-modal visual-text similarity, which gives rise to certain limitations in compression performance and practical deployment. In contrast, we propose PIO-FVLM from the perspective of inference objectives, which transforms visual token compression into preserving output result invariance and selects tokens primarily by their importance to this goal. Specially, vision tokens are reordered with the guidance of token-level gradient saliency generated by our designed layer-local proxy loss, a coarse constraint from the current layer to the final result. Then the most valuable vision tokens are selected following the non-maximum suppression (NMS) principle. The proposed PIO-FVLM is training-free and compatible with FlashAttention, friendly to practical application and deployment. It can be deployed independently as an encoder-free method, or combined with encoder compression approaches like VisionZip for use as an encoder-involved method. On LLaVA-Next-7B, PIO-FVLM retains just 11.1% of visual tokens but maintains 97.2% of the original performance, with a 2.67$ imes$ prefill speedup, 2.11$ imes$ inference speedup, 6.22$ imes$ lower FLOPs, and 6.05$ imes$ reduced KV Cache overhead. Our code is available at https://github.com/ocy1/PIO-FVLM.

研究の動機と目的

推論駆動の観点から VLM のトークン削減を動機づける（注意機構/類似性ヒューリスティクスではなく）
推論時にトークン重要度を推定する軽量なレイヤ局所 proxy loss を開発する
勾配顕著性に基づくトークン再順序付けと NMS ベースの選択を提案し、予算内で冗長でないサブセットを効率的に削減する
FlashAttention との互換性を確保し、エンコーダーなしまたはエンコーダー関与のデプロイオプションを可能にする

提案手法

剪定層でレイヤ局所 proxy loss を計算して現在の出力に対する各トークンの影響を推定する
各剪定層でトークンごとの勾配顕著性スコアを逆伝播して取得する
顕著性の降順でトークンを再配置し、勾配ベースの NMS 戦略を適用して予算内で非冗長なサブセットを選択する
剪定層は浅い順から深い順へ適用（例: LLaVA では [1,10,15]、Qwen では [1,8,14]）、K_pos が近接監視ウィンドウを制御する
全フォワード・アテンション構造を維持し、方法は FlashAttention との互換性を保ち、エンコーダーなしまたはエンコーダー関与オプションとしてプラグアンドプレイ可能

実験結果

リサーチクエスチョン

RQ1トレーニング不要で目的駆動のトークン削減は、積極的な視覚トークン削減下でも VLM の出力品質を保てるか
RQ2勾配顕著性と非最大抑制を組み合わせて、推論時のトークン重要度と多様性をどうバランスさせるか
RQ3PIO-FVLM を既存の VLM バックボーンと注意バックエンドに組み込んだ場合の実用的な効率向上とメモリ節約はどの程度か

主な発見

PIO-FVLM はトレーニング不要で FlashAttention 互換、エンコーダーなしまたはエンコーダー関与の加速として使用可能である
LLaVA-1.5-7B において、視覚トークンの 11.1% を保持しても精度低下は最小限（0程度からの低下、強力なベースラインと競合する水準）、推論スピードが大幅に向上。具体的には 11.1% バジェット時に 2.11x の推論スピードアップと 6.22x の FLOPs 削減（論文の例）
LLaVA-Next-7B では 11.1% のトークン保持で 2.67x のプレフィル速度、総速度 2.11x、KVキャッシュ削減 6.05x、複数設定で 96% 以上の保持を実現して良好な性能を維持
Qwen-2.5-VL-7B では 11.1% トークン保持時に DART の 84.8% に対して約 91.7% の平均保持を達成
二段階スキーム（SCOPE-スタイルのプレフィルタリング + PIO-FVLM の洗練）を同じ予算下でプレフィルタリングのみを上回ることが可能である

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。