QUICK REVIEW

[논문 리뷰] See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs

Yongchang Zhang, Ou Ma|arXiv (Cornell University)|2026. 02. 25.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

이 논문은 LVLM에서 시각적으로 근거가 있는 다중 모달 추론을 위한 학습 없이 Plug-and-Play 디코딩 프레임워크인 ECRD를 도입하며, 텍스트 증거 풀과 필요 시 시각 디시더를 활용해 환각을 억제하고 정확도를 향상시킨다.

ABSTRACT

Recent large vision-language models (LVLMs) have demonstrated impressive reasoning ability by generating long chain-of-thought (CoT) responses. However, CoT reasoning in multimodal contexts is highly vulnerable to visual hallucination propagation: once an intermediate reasoning step becomes inconsistent with the visual evidence, subsequent steps-even if logically valid-can still lead to incorrect final answers. Existing solutions attempt to mitigate this issue by training models to "think with images" via reinforcement learning (RL). While effective, these methods are costly, model-specific, and difficult to generalize across architectures. Differently, we present a lightweight method that bypasses RL training and provides an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning. Our key idea is to supervise each reasoning step at test time with visual evidence, ensuring that every decoded token is justified by corresponding visual cues. Concretely, we construct a textual visual-evidence pool that guides the model's reasoning generation. When existing evidence is insufficient, a visual decider module dynamically extracts additional relevant evidence from the image based on the ongoing reasoning context, expanding the pool until the model achieves sufficient visual certainty to terminate reasoning and produce the final answer. Extensive experiments on multiple LVLM backbones and benchmarks demonstrate the effectiveness of our approach. Our method achieves 16.5%-29.5% improvements on TreeBench and 13.7% RH-AUC gains on RH-Bench, substantially reducing hallucination rates while improving reasoning accuracy without additional training.

연구 동기 및 목표

장기 다중 모달 추론에서 비용이 많이 드는 RL 미세 조정 없이 시각적 환각 전파를 줄이는 것을 목표로 동기를 부여합니다.
각 디코딩 단계에 대한 정당화를 위한 텍스트 기반 시각 증거를 사용하는 테스트 시점 감독 메커니즘을 개발합니다.
고정된 LVLM 백본과 호환되도록 가벼우면서 모델에 구애받지 않는 프레임워크를 설계합니다.
누적된 텍스트 증거의 재사용을 가능하게 하여 후속 추론 단계를 안정화합니다.

제안 방법

디코딩 중 기본 LVLM과 함께 텍스트 증거 풀을 유지합니다.
각 단계에서 증거로 유도된 분포를 사용하여 기본 다음 토큰 분포를 재가중합니다.
불확실성이 환각 가능성을 시사할 때만 경량 시각 디시더를 작동시키고 풀에 간결한 텍스트 마이크로 관찰을 추가합니다.
추론 중 반복적인 픽셀 처리와 자르기를 피하기 위해 모든 시각 근거를 텍스트로 표현합니다.
최소-대-접두 KL 대신 접두사 전체의 평균 증거 점수를 사용하여 여러 문장에 걸친 증거를 집계합니다.
최종 혼합 분포를 형성하기 위해 증거 유도 분포를 기본 분포와 질량 일치 방식으로 혼합하여 토큰 선택에 사용합니다.

실험 결과

연구 질문

RQ1학습 없이 디코딩 시간 프레임워크가 미세 조정 없이 LVLM의 환각을 줄일 수 있는가?
RQ2텍스트 시각 증거를 축적하고 경량 시각 디시더에 문의하는 것이 백본 모델 전반에 걸친 근거 제시 및 최종 태스크 정확도를 향상시키는가?
RQ3불확실성 주도 증거 획득이 다중 모달 추론에서 정확도와 지연을 어떻게 균형 있게 조절하는가?
RQ4다양한 LVLM 백본과 벤치마크에 걸쳐 이 접근법이 전이 가능한가?

주요 결과

ECRD는 오픈 소스 백본 전반에서 일관된 정확도 상승을 보였고 TreeBench에서 규모를 확장합니다.
Qwen2.5-VL-7B에서 전반적 정확도가 ECRD로 37.0%에서 47.9%로 상승합니다.
ECRD는 RH-Bench Reasoning 및 Perception 점수와 RH-AUC를 향상시켜 더 나은 근거 제시와 환각 감소를 나타냅니다.
ECRD는 Qwen2.5-VL-7B 및 LLaVA-OneVision-7B와 같은 모델에서 다섯 가지 일반 다중 모달 벤치마크(V*Bench, MathVista, ChartQA, OCRBench, HallusionBench)에서 성능을 향상시킵니다.
어레이션은 감독자와 시각 디시더 모두가 이득에 기여하며, 평균-대-접두 증거 점수가 최소-대-접두 KL보다 더 우수한 것으로 나타났습니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.