[论文解读] Multimodal Fact-Level Attribution for Verifiable Reasoning
摘要:论文提出 MURGAT,一项需要模型给出可验证、带引用的答案的多模态推理基准,并提供一个自动评分管线(MURGAT-SCORE)来评估论断的 grounding 与引用质量。
Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation-based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments. Benchmarking with human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we observe a key trade-off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution.
研究动机与目标
- 让模型在异构输入上实现可信、可验证的多模态推理。
- 提出 MURGAT,以评估具备精准模态与时间戳引用的事实级归属。
- 将评估拆分为可验证的论断识别、原子事实分解与归属质量。
- 开发自动化、可扩展的度量(MURGAT-SCORE),与人类判断高度相关。
提出的方法
- 定义 MURGAT:多模态大模型在特定模态与时间戳绑定的推理与引用下回答问题。
- 三子任务评估:可验证的论断识别、原子事实分解、以及归属质量。
- 将原子事实去语境化并与其引用集合搭配,以衡量引用是否能推出每个原子事实。
- 使用召回率、精确度和 F1 来评估归属质量,并与覆盖度结合形成 MURGAT-SCORE。
- 构建自动化评估(MURGAT-SCORE),并在 WorldSense 与 Video-MMMU 数据集上与人工标注进行对比验证。
实验结果
研究问题
- RQ1MLLMs 是否能够在多模态与不同时段段落中给出可验证、引用充分的答案?
- RQ2当前模型在 grounding 与 citing 证据方面与人类在多模态推理任务上的判断有多一致?
- RQ3在多模态任务中,推理深度、 grounding 精确度与引用可靠性之间存在哪些权衡?
主要发现
| Model | Method | WorldSense Coverage | WorldSense Attribution | WorldSense MURGAT-S | WorldSense Acc | Video-MMMU Coverage | Video-MMMU Attribution | Video-MMMU MURGAT-S | Video-MMMU Acc |
|---|---|---|---|---|---|---|---|---|---|
| Gemini-2.5-Flash | + CITATION | 81.2 | 65.4 | 54.1 | 66.5 | 63.0 | 63.4 | 41.5 | 84.9 |
| Gemini-2.5-Flash | + POST-HOC ATTRIBUTION | 97.4 | 62.3 | 60.8 | 62.3 | 73.8 | 44.9 | 38.0 | 84.2 |
| Gemini-3-Flash | + CITATION | 95.9 | 66.5 | 64.4 | 66.2 | 88.2 | 64.5 | 56.9 | 86.0 |
| Gemini-3-Flash | + POST-HOC ATTRIBUTION | 95.1 | 71.4 | 69.2 | 67.0 | 87.9 | 47.2 | 44.1 | 86.8 |
| Gemini-3-Pro | + CITATION | 78.3 | 64.9 | 51.7 | 70.0 | 63.4 | 67.3 | 41.8 | 86.0 |
| Gemini-3-Pro | + POST-HOC ATTRIBUTION | 97.0 | 67.1 | 65.2 | 71.4 | 68.0 | 43.7 | 36.9 | 85.3 |
| Qwen3-Omni-Instruct | + CITATION | 47.6 | 53.3 | 29.0 | 54.0 | 34.6 | 21.8 | 9.8 | 40.0 |
| Qwen3-Omni-Instruct | + POST-HOC ATTRIBUTION | 99.5 | 45.7 | 45.4 | 57.0 | 95.1 | 17.9 | 17.6 | 45.0 |
| Qwen3-Omni-Thinking | + CITATION | 52.7 | 56.3 | 31.3 | 61.0 | 36.3 | 7.6 | 4.8 | 51.0 |
| Qwen3-Omni-Thinking | + POST-HOC ATTRIBUTION | 93.2 | 60.0 | 56.3 | 56.5 | 76.3 | 16.8 | 12.8 | 53.0 |
| Qwen3-VL-Instruct | + CITATION | 39.0 | 52.0 | 25.5 | 48.0 | 30.2 | 40.1 | 17.5 | 55.0 |
| Qwen3-VL-Instruct | + POST-HOC ATTRIBUTION | 98.9 | 70.2 | 69.4 | 69.4 | 93.4 | 44.6 | 42.3 | 53.0 |
| Qwen3-VL-Thinking | + CITATION | 38.5 | 56.1 | 30.8 | 49.0 | 23.2 | 15.1 | 7.6 | 60.0 |
| Qwen3-VL-Thinking | + POST-HOC ATTRIBUTION | 76.6 | 58.9 | 48.2 | 47.0 | 54.3 | 31.5 | 18.9 | 51.0 |
| Molmo2 | + CITATION | 69.1 | 50.2 | 39.7 | 40.0 | 82.6 | 21.4 | 19.3 | 44.3 |
| Molmo2 | + POST-HOC ATTRIBUTION | 75.0 | 38.3 | 33.2 | 41.0 | 66.4 | 15.0 | 11.4 | 50.5 |
- 强大的 MLLMs 往往能正确回答问题,但会编造引用且缺乏充分的归属。
- MURGAT-SCORE 与人类判断高度相关(端到端相关性平均为 0.84),并且优于以 LLM 作为评判基线的方法。
- 存在权衡:增加推理深度或强制结构化 grounding 可能在复杂任务中降低准确性。
- 程序化 grounding 与放大思维提高了归属,但可能使推理与可验证证据脱钩。
- 更大模型在更多算力下改进 grounding,但小模型随着投入增多而 MURGAT-SCORE 下降,表明潜在推理可能与证据脱节。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。