QUICK REVIEW

[論文レビュー] MACD: Model-Aware Contrastive Decoding via Counterfactual Data

Qixin Xiao|arXiv (Cornell University)|Feb 2, 2026

Adversarial Robustness in Machine Learning被引用数 0

ひとこと要約

MACDはモデル自身の損失に基づいて証拠重要な物体とフレームをマスクすることで、対照的デコードを導くデータを提案し、幻視を抑制する。

ABSTRACT

Video language models (Video-LLMs) are prone to hallucinations, often generating plausible but ungrounded content when visual evidence is weak, ambiguous, or biased. Existing decoding methods, such as contrastive decoding (CD), rely on random perturbations to construct contrastive data for mitigating hallucination patterns. However, such a way is hard to control the visual cues that drive hallucination or well align with model weaknesses. We propose Model-aware Counterfactual Data based Contrastive Decoding (MACD), a new inference strategy that combines model-guided counterfactual construction with decoding. Our approach uses the Video-LLM's own feedback to identify object regions most responsible for hallucination, generating targeted counterfactual inputs at the object level rather than arbitrary frame or temporal modifications. These model-aware counterfactual data is then integrated into CD to enforce evidence-grounded token selection during decoding. Experiments on EventHallusion, MVBench, Perception-test and Video-MME show that MACD consistently reduces hallucination while maintaining or improving task accuracy across diverse Video-LLMs, including Qwen and InternVL families. The method is especially effective in challenging scenarios involving small, occluded, or co-occurring objects. Our code and data will be publicly released.

研究の動機と目的

Video-LLMsにおける幻視の動機付けと対処、特に弱いまたは偏った視覚証拠下で。
トレーニング不要・推論時のみの手法を導入し、モデルのフィードバックを用いてターゲットを絞ったカウンターファクチュアル入力を作成。
オブジェクトレベルとフレームレベルのカウンターファクチュアルデータを対照的デコードに統合し、グラウンディングを強制。
多様なベンチマークとバックボーンモデルでの頑健性と改善を実証。

提案手法

YOLO風検出器で動画フレーム中の物体を検出し、時間的一貫性のある物体トラックを形成。
物体レベルとフレームレベルのマスクを組み合わせてカウンターファクチュアル動画を作成し、マスクされたビューを生成。
Video-LLMの再構成損失を勾配上昇で最適化し、エビデンスが重要な領域を特定する。
最適化された強度を{0, r0, 1}へ離散化し、カウンターファクチュアルの安定性と解釈性を確保。
基礎ビューと、チューニング係数を用いてマイニングされたカウンターファクチュアルビューを用いた対照的デコードを適用し、根拠のあるトークンを促進し幻視を抑制。
推論時のみの設定を維持し、デコードステップごとに追加のフォワードパスは1回のみ。

実験結果

リサーチクエスチョン

RQ1モデルのフィードバックをどのように用いて、Video-LLMの弱点を露示するターゲットを絞ったカウンターファクチュアルな視覚変動を生成できるか。
RQ2モデルが導く物体レベル・フレームレベルのマスキングは、ランダムな摂動や内部トークン抑制と比べて対照的デコードを改善するか。
RQ3MACDはトレーニング不要で、さまざまなVideo-LLMsとベンチマークに対応し、タスク精度を維持または向上できるか。

主な発見

Model	Method	Precision	Recall	F1	Accuracy	Accuracy	Accuracy	Accuracy
Qwen3-VL-2B	Baseline	0.7606	0.6131	0.6829	0.5959	0.5467	0.55	0.463
Qwen3-VL-2B	SID	0.7947	0.7190	0.7768	0.7202	0.4799	0.4867	0.56
Qwen3-VL-2B	VCD	0.7485	0.9124	0.8224	0.7202	0.5567	0.5367	0.438
Qwen3-VL-2B	MACD	0.7564	0.9708	0.8471	0.7513	0.7733	0.616	0.643
Qwen2.5-VL-3B	Baseline	0.758064516129032	0.686131386861313	0.720306513409961	0.621761658031088	0.44	0.524476	0.541
Qwen2.5-VL-3B	SID	0.755555555555555	0.744525547445255	0.75	0.647668393782383	0.6508	0.35	0.506
Qwen2.5-VL-3B	VCD	0.735294117647058	0.72992700729927	0.732600732600732	0.621761658031088	0.4515	0.3467	0.513
Qwen2.5-VL-3B	MACD	0.804511278195488	0.781021897810219	0.792592592592592	0.709844559585492	0.67	0.608392	0.621
Qwen2-VL-7B	Baseline	0.614035087719298	0.255474452554744	0.360824742268041	0.357512953367875	0.4633	0.403333	0.445
Qwen2-VL-7B	SID	0.746268656716418	0.364963503649635	0.490196078431372	0.461139896373056	0.47	0.3133	0.429
Qwen2-VL-7B	VCD	0.661157024793388	0.583941605839416	0.620155038759689	0.492227979274611	0.41	0.3433	0.439
Qwen2-VL-7B	MACD	0.712871287128712	0.525547445255474	0.605042016806722	0.512953367875647	0.4967	0.42953	0.455
InternVL3-8B	Baseline	0.5964912280715	0.248175182	0.350515464	0.347150259	0.4633	0.3667	0.479
InternVL3-8B	SID	0.666666666666667	0.364963504	0.471698113	0.419689119	0.37	0.38	0.462
InternVL3-8B	VCD	0.636491228077	0.248175182	0.350515464	0.367150259	0.43	0.3467	0.437
InternVL3-8B	MACD	0.6687	0.4231	0.415273	0.43678	0.5467	0.4567	0.49

MACDは6つのバックボーンと4つのベンチマークで一貫してBaseline、VCD、SIDを上回る。
モデルの損失に導かれたオブジェクトレベル・フレームレベルのマスキングは、再現率を高めつつ精度を損なわず、F1と精度を向上。
MACDはEventHallusionにおけるオブジェクト幻視指標を0.72から0.85の精度へ、0.70から0.80のF1へ改善し、誤陽性率を40.0%から17.0%へ低減。
アブレーションにより、オブジェクトごとの強度とフレームマスクを含む完全なMACD構成がRecallとPrecisionの最良のバランスを提供。
人間評価は、MACDマスクがクエリ関連の証拠に焦点を合わせ、ランダムな遮蔽よりも優れていることを確認。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。