QUICK REVIEW

[論文レビュー] Chain-of-Thought Compression Should Not Be Blind: V-Skip for Efficient Multimodal Reasoning via Dual-Path Anchoring

Dongxu Zhang, Yiding Sun|arXiv (Cornell University)|Jan 20, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

V-Skip は CoT 圧縮を Visual-Anchored Information Bottleneck に再定義し、双路（テキストと視覚）ゲーティング機構を用いてトークンを剪定することで、最小限の正確性損失で最大 2.9 倍の速度向上とより良い視覚的根拠を実現します。

ABSTRACT

While Chain-of-Thought (CoT) reasoning significantly enhances the performance of Multimodal Large Language Models (MLLMs), its autoregressive nature incurs prohibitive latency constraints. Current efforts to mitigate this via token compression often fail by blindly applying text-centric metrics to multimodal contexts. We identify a critical failure mode termed Visual Amnesia, where linguistically redundant tokens are erroneously pruned, leading to hallucinations. To address this, we introduce V-Skip that reformulates token pruning as a Visual-Anchored Information Bottleneck (VA-IB) optimization problem. V-Skip employs a dual-path gating mechanism that weighs token importance through both linguistic surprisal and cross-modal attention flow, effectively rescuing visually salient anchors. Extensive experiments on Qwen2-VL and Llama-3.2 families demonstrate that V-Skip achieves a $2.9 imes$ speedup with negligible accuracy loss. Specifically, it preserves fine-grained visual details, outperforming other baselines over 30\% on the DocVQA.

研究の動機と目的

マルチモーダル文脈におけるテキスト中心の CoT 圧縮がなぜ失敗するのか（視覚的記憶喪失）を特定する。
冗長性を剪定しつつ視覚的アンカーを保持する grounding-aware 圧縮フレームワークを提案する。
マルチモーダルトークン剪定のための Visual-Anchored Information Bottleneck (VA-IB) 目的関数を定式化する。
言語的冗長性と視覚的必要性を評価するデュアルパススコアリング機構を開発する。
効率的推論のために剪定ポリシーを軽量なアダプターへ蒸留する。

提案手法

トークン圧縮を VA-IB として定式化し、I(C_hat; A) + λ I(C_hat; V | Q) を最大化しつつ |C_hat| <= γ|C| を満たす。
Visual Anchoring Score (VAS) を導入する：S_text はトークン尤度、S_vis はクロスモーダルアテンションマス Ω_t をフォーカス層とヘッドごとに集約した値。
トークンを選択するために Union-of-Saliency ゲート m_t = I(S_text >= tau_text) OR I(S_vis >= tau_vis) を適用。
オフライン蒸留による効率的デコーダを訓練する：ゲートで剪定して D_distill を作成し、ベースモデルを LoRA でファインチューニング。
Qwen2-VL および Llama-3.2 ファミリを MMMU と DocVQA の各評価で、Accuracy、ANLS、Latency、ActRatio を用いて評価する。

実験結果

リサーチクエスチョン

RQ1視覚 grounding を維持しつつ、マルチモーダル CoT シーケンスを圧縮しても過度な精度低下を招かないか。
RQ2デュアルパス（テキストと視覚）のサリエンシー機構は剪定中の視覚的記憶喪失や物体幻覚を防ぐか。
RQ3VA-IB ガイド剪定はテキスト中心および他のマルチモーダル剪定のベースラインと比較して速度と精度の点でどうか。
RQ4V-Skip 蒸留アダプターはモデルサイズを問わず効率的推論に有効か。
RQ5圧縮が DocVQA のような細粒度の視覚タスクや視覚属性保持に与える影響はどうか。

主な発見

Method	Ratio γ	MMMU Acc (%)	MMMU Tokens	MMMU Latency (s)	MMMU ActRatio	DocVQA ANLS	DocVQA Tokens	DocVQA Latency (s)	DocVQA ActRatio
Original (Full)	-	54.1	245.0	6.42	-	91.6	189.0	4.87	-
Truncation	0.9	50.8 ( -3.3 )	220.5	5.84	0.90	84.2 ( -7.4 )	170.1	4.52	0.90
Truncation	0.7	44.5 ( -9.6 )	171.5	4.51	0.70	71.5 (-20.1)	132.3	3.51	0.70
Truncation	0.5	38.5 ( -15.6 )	122.5	3.23	0.50	62.5 (-29.1)	94.5	2.57	0.50
LLMLingua-2	0.9	49.5 ( -4.6 )	223.2	6.03	0.91	78.4 (-13.2)	173.8	4.74	0.92
LLMLingua-2	0.7	40.2 ( -13.9 )	166.6	4.97	0.68	55.6 (-36.0)	130.4	3.81	0.69
LLMLingua-2	0.5	32.4 ( -21.7 )	115.1	3.73	0.47	38.5 (-53.1)	88.8	2.93	0.47
ASCoT	0.9	50.1 ( -4.0 )	218.0	6.12	0.89	79.8 (-11.8)	168.2	4.78	0.89
ASCoT	0.7	41.8 ( -12.3 )	176.4	5.20	0.72	58.2 (-33.4)	136.1	4.03	0.72
ASCoT	0.5	33.1 ( -21.0 )	124.9	3.91	0.51	40.2 (-51.4)	90.7	3.10	0.48
V-Skip (Ours)	1.0	54.1 ( 0.0 )	245.0	6.54	1.00	91.6 ( 0.0 )	189.0	5.09	1.00
V-Skip (Ours)	0.9	53.6 ( -0.5 )	227.8	5.89	0.93	90.8 (-0.8)	172.1	4.61	0.91
V-Skip (Ours)	0.8	52.8 ( -1.3 )	193.6	5.24	0.79	89.5 (-2.1)	156.9	4.08	0.83
V-Skip (Ours)	0.7	51.5 ( -2.6 )	173.4	4.65	0.71	87.9 (-3.7)	132.4	3.68	0.70
V-Skip (Ours)	0.6	50.1 ( -4.0 )	151.9	4.05	0.62	85.8 (-5.8)	115.3	3.19	0.61
V-Skip (Ours)	0.5	48.2 ( -5.9 )	120.1	3.49	0.49	83.7 (-7.9)	98.4	2.71	0.52

V-Skip は Qwen2-VL-7B-Instruct で推論速度を最大 2.9 倍向上させつつ、ほとんど精度を失わない。
V-Skip は視覚的アンカーを保持し、DocVQA ANLS でベースラインを 30% 以上上回る。
MMMU では V-Skip の精度低下が 5.9% に抑えられ、ベースラインの >20% よりも小さい。
gamma=0.5 の場合でも Llama-3.2-11B-Vision-Instruct で 91% 以上の性能を維持し、モデルサイズに対する頑健性を示す。
V-Skip の視覚属性保持率（VARR）：色 89.4%、物体 91.2%、形状 86.3%、ベースラインより大幅に高い。
POPE 誤認識評価における Yes-バイアスは V-Skip によって抑制され、Yes-比はベースラインに近い（51.2% 対 50.4%）。
アブレーションでは union gating（S_text と S_vis）が Acc と VARR の両方で最も良く、Text-Only や Vision-Only より優れる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。