QUICK REVIEW

[论文解读] V-Retrver: Evidence-Driven Agentic Reasoning for Universal Multimodal Retrieval

Dongyang Chen, Wang, Chaoyang|arXiv (Cornell University)|Feb 5, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

V-Retrver 引入一个以证据为驱动的代理检索框架，将多模态推理与外部工具的目标化视觉验证交替进行，在多模态基准测试中显著提升检索准确性和泛化能力。

ABSTRACT

Multimodal Large Language Models (MLLMs) have recently been applied to universal multimodal retrieval, where Chain-of-Thought (CoT) reasoning improves candidate reranking. However, existing approaches remain largely language-driven, relying on static visual encodings and lacking the ability to actively verify fine-grained visual evidence, which often leads to speculative reasoning in visually ambiguous cases. We propose V-Retrver, an evidence-driven retrieval framework that reformulates multimodal retrieval as an agentic reasoning process grounded in visual inspection. V-Retrver enables an MLLM to selectively acquire visual evidence during reasoning via external visual tools, performing a multimodal interleaved reasoning process that alternates between hypothesis generation and targeted visual verification.To train such an evidence-gathering retrieval agent, we adopt a curriculum-based learning strategy combining supervised reasoning activation, rejection-based refinement, and reinforcement learning with an evidence-aligned objective. Experiments across multiple multimodal retrieval benchmarks demonstrate consistent improvements in retrieval accuracy (with 23.0% improvements on average), perception-driven reasoning reliability, and generalization.

研究动机与目标

解决依赖静态视觉编码的语言驱动检索的局限性。
开发一个面向通用多模态检索的证据为基础的代理式重排序框架。
通过基于课程的训练方法，使推理、工具使用与排序对齐。
在推理过程中实现动态视觉验证，以解决细粒度的歧义。

提出的方法

Embed-and-propose: 使用嵌入模型检索前K个候选，降低候选池规模（K << N）。
Multimodal interleaved evidence reasoning (MIER): 通过外部工具进行迭代的假设生成与视觉证据验证。
Visual Tools: SELECT-IMAGE 选择候选进行检查，ZOOM-IN 进行局部属性分析。
Curriculum-based training: 第I阶段冷启动SFT，结合合成的CoT数据；第II阶段拒绝采样微调；第III阶段通过GRPO的证据对齐策略优化（EAPO）。
Evidence-aligned rewards: 通过格式合规性、软排名和工具使用奖励来引导策略优化。
Optimization: 基于GRPO的目标函数，使用归一化优势来训练智能体。

Figure 1 : Comparison between text-based CoT (left) and multimodal interleaved CoT (right) for multimodal retrieval. Text-based CoT relies on language-driven inference over static visual representations, often failing to resolve fine-grained differences. In contrast, V-Retrver performs multimodal in

实验结果

研究问题

RQ1V-Retrver 是否在通用多模态检索基准上超越强基线？
RQ2将外部工具的交叉视觉推理是否提升在视觉模糊情形中的定位与排序的可靠性？
RQ3三阶段课程对推理质量、工具使用和检索性能的影响是什么？
RQ4该方法对未见领域和未公开模态组合的泛化能力如何？

主要发现

V-Retrver-7B 在 M-BEIR 上达到 69.7% 的平均 Recall，达到最新水平并比最强基线高出 4.9 个百分点。
模型在细粒度视觉区分上表现出色，特别是在需要详细视觉检查的任务中（如 FIQ 和 CIRR 的 q^i,q^t → c^i）。
零-shot 评估显示对未见数据集具有强泛化能力（如 CIRCO MAP@5 为 48.2；GeneCIS R@1 为 30.7）。
Held-out 任务评估的平均 Recall 为 61.1%，比此前最佳高出 10.2%。
消融研究表明完整的三阶段课程和基于工具的推理可实现最佳性能（67.2% 的平均 Recall），且视觉工具显著优于仅文本的 CoT 基线（67.2% 对 61.8%）。

Figure 2 : Overview of the V-Retrver framework. The left panel illustrates the inference pipeline, featuring a coarse-to-fine process with embedding-based retrieval and agentic reranking. The right panel details the three training stages we proposed, including Cold Start, Rejection sampling Fine-Tun

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。