QUICK REVIEW

[论文解读] See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs

Yongchang Zhang, Ou Ma|arXiv (Cornell University)|Feb 25, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

论文介绍了 ECRD，一种面向 LVLM 的训练无关、即插即用的解码框架，利用文本证据池和按需视觉判定器来抑制幻觉并提升准确性。

ABSTRACT

Recent large vision-language models (LVLMs) have demonstrated impressive reasoning ability by generating long chain-of-thought (CoT) responses. However, CoT reasoning in multimodal contexts is highly vulnerable to visual hallucination propagation: once an intermediate reasoning step becomes inconsistent with the visual evidence, subsequent steps-even if logically valid-can still lead to incorrect final answers. Existing solutions attempt to mitigate this issue by training models to "think with images" via reinforcement learning (RL). While effective, these methods are costly, model-specific, and difficult to generalize across architectures. Differently, we present a lightweight method that bypasses RL training and provides an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning. Our key idea is to supervise each reasoning step at test time with visual evidence, ensuring that every decoded token is justified by corresponding visual cues. Concretely, we construct a textual visual-evidence pool that guides the model's reasoning generation. When existing evidence is insufficient, a visual decider module dynamically extracts additional relevant evidence from the image based on the ongoing reasoning context, expanding the pool until the model achieves sufficient visual certainty to terminate reasoning and produce the final answer. Extensive experiments on multiple LVLM backbones and benchmarks demonstrate the effectiveness of our approach. Our method achieves 16.5%-29.5% improvements on TreeBench and 13.7% RH-AUC gains on RH-Bench, substantially reducing hallucination rates while improving reasoning accuracy without additional training.

研究动机与目标

在不进行昂贵 RL 微调的情况下，推动减少长链多模态推理中的视觉幻觉传播。
开发一个测试时监督机制，利用文本视觉证据为每一步解码提供正当性。
设计一个轻量级、模型不可知的框架，兼容冻结的 LVLM 主干。
实现对累积文本证据的重复使用，以稳定后续推理步骤。

提出的方法

在解码过程中，维持一个文本证据池，与基础 LVLM 同步。
在每一步，使用证据诱导分布重新加权基础的下一个 token 分布。
仅在不确定性指示可能出现幻觉时触发一个轻量级视觉判定器，向证据池中添加简明的文本微观观测。
将所有视觉对齐表示为文本，以避免在推理过程中的重复像素处理与裁剪。
采用对前缀的证据评分的均值，而非对前缀 KL 的最小值，聚合多句证据。
以质量匹配的方式将证据诱导分布与基础分布混合，形成用于 token 选择的最终混合分布。

实验结果

研究问题

RQ1一个训练无关、解码时框架是否能够在不微调的情况下减少 LVLM 的幻觉？
RQ2累积文本视觉证据并查询一个轻量级视觉判定器是否能提高对齐与最终任务准确性，适用于不同的骨干模型？
RQ3基于不确定性的证据获取如何在多模态推理中平衡准确性与延迟？
RQ4该方法是否可在不同的 LVLM 主干和基准之间转移？

主要发现

ECRD 在开源骨干上呈现稳定的准确性提升，并在 TreeBench 上具备扩展性。
在 Qwen2.5-VL-7B 上，整体准确性从 37.0% 提升到 47.9%（使用 ECRD）。
ECRD 提升 RH-Bench Reasoning 与 Perception 分数及 RH-AUC，表明更好的对齐与幻觉减少。
ECRD 在五个通用多模态基准（V*Bench、MathVista、ChartQA、OCRBench、HallusionBench）上的模型如 Qwen2.5-VL-7B 与 LLaVA-OneVision-7B 的表现均有提升。
消融实验显示监督者与视觉判定器均对增益有贡献，且对前缀的均值证据评分优于对前缀 KL 的最小值。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。