QUICK REVIEW

[論文レビュー] See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs

Yongchang Zhang, Ou Ma|arXiv (Cornell University)|Feb 25, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

本論文は視覚 grounding を伴う多模态推論のためのトレーニング不要・プラグアンドプレイ型デコーディングフレームワーク ECRD を提案。テキスト証拠プールとオンデマンドのビジュアルデシーダを用いて幻覚を抑制し、精度を向上させる。

ABSTRACT

Recent large vision-language models (LVLMs) have demonstrated impressive reasoning ability by generating long chain-of-thought (CoT) responses. However, CoT reasoning in multimodal contexts is highly vulnerable to visual hallucination propagation: once an intermediate reasoning step becomes inconsistent with the visual evidence, subsequent steps-even if logically valid-can still lead to incorrect final answers. Existing solutions attempt to mitigate this issue by training models to "think with images" via reinforcement learning (RL). While effective, these methods are costly, model-specific, and difficult to generalize across architectures. Differently, we present a lightweight method that bypasses RL training and provides an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning. Our key idea is to supervise each reasoning step at test time with visual evidence, ensuring that every decoded token is justified by corresponding visual cues. Concretely, we construct a textual visual-evidence pool that guides the model's reasoning generation. When existing evidence is insufficient, a visual decider module dynamically extracts additional relevant evidence from the image based on the ongoing reasoning context, expanding the pool until the model achieves sufficient visual certainty to terminate reasoning and produce the final answer. Extensive experiments on multiple LVLM backbones and benchmarks demonstrate the effectiveness of our approach. Our method achieves 16.5%-29.5% improvements on TreeBench and 13.7% RH-AUC gains on RH-Bench, substantially reducing hallucination rates while improving reasoning accuracy without additional training.

研究の動機と目的

長い連鎖の多模态推論で高コストなRLファインチューニングなしに視覚幻覚の伝搬を抑制する動機付け。
各デコーディングステップを正当化するためのテキストベースの視覚証拠を用いたテスト時の監視機構を開発する。
凍結済みLVLMバックボーンと互換性のある軽量でモデル非依存のフレームワークを設計する。
蓄積されたテキスト証拠の再利用を可能にして後続の推論ステップを安定化させる。

提案手法

デコーディング中、基盤の LVLM と併せてテキスト証拠プールを維持する。
各ステップで、証拠誘導分布を用いて基盤の次トークン分布を再ウェイトする。
不確実性が幻覚の可能性を示す場合にのみ軽量なビジュアルデシーダを作動させ、プールへ簡潔なテキストのマイクロ観察を追加する。
推論中の繰り返しのピクセル処理やクロップを避けるため、全ての視覚 grounding をテキストとして表現する。
ミニマム対プレフィックスKLではなく平均オブプレフィックス証拠スコアを用い、複数の文から証拠を集約する。
証拠誘導分布を基盤分布と質量整合の形で混合し、トークン選択の最終混合分布を形成する。

実験結果

リサーチクエスチョン

RQ1トレーニング不要のデコード時フレームワークはファインチューニングなしで LVLM の幻覚を低減できるか。
RQ2テキスト視覚証拠を蓄積し、軽量なビジュアルデシーダを照会することは Backbone モデル全体で grounding とエンドタスク精度を改善するか。
RQ3不確実性駆動の証拠取得は、マルチモーダル推論における精度と待機時間のバランスをどう取るか。
RQ4このアプローチは多様な LVLM バックボーンとベンチマーク間で移植可能か。

主な発見

ECRD はオープンソースのバックボーン全体で一貫した精度向上を示し、TreeBench でスケールする。
Qwen2.5-VL-7B では、全体の精度が 37.0% から 47.9% に向上。
ECRD は RH-Bench Reasoning および Perception スコアと RH-AUC を改善し、 grounding の向上と幻覚の削減を示す。
Qwen2.5-VL-7B および LLaVA-OneVision-7B のようなモデルで、V*Bench、MathVista、ChartQA、OCRBench、HallusionBench の五つの一般的なマルチモーダルベンチマーク全体で性能が向上。
アブレーション解析では、監視者とビジュアルデシーダの双方が利得に寄与し、平均オブプレフィックス証拠スコアが min-オブ-プレフィックス KL より優れている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。