QUICK REVIEW

[论文解读] BRIDGE: Benchmark for multi-hop Reasoning In long multimodal Documents with Grounded Evidence

Biao Xiang, Soyeon Caren Han|arXiv (Cornell University)|Mar 9, 2026

Topic Modeling被引用 0

一句话总结

BRIDGE 引入一个覆盖科学论文的长篇多模态 QA 基准，要求进行跨模态的多跳推理并提供显式的 grounding 证据并覆盖文本、表格与图形，同时提供逐步注释与评估。

ABSTRACT

Multi-hop question answering (QA) is widely used to evaluate the reasoning capabilities of large language models, yet most benchmarks focus on final answer correctness and overlook intermediate reasoning, especially in long multimodal documents. We introduce BRIDGE, a benchmark for multi-hop reasoning over long scientific papers that require integrating evidence across text, tables, and figures. The dataset supports both chain-like and fan-out structures and provides explicit multi-hop reasoning annotations for step-level evaluation beyond answer accuracy. Experiments with state-of-the-art LLMs and multimodal retrieval-augmented generation (RAG) systems reveal systematic deficiencies in evidence aggregation and grounding that remain hidden under conventional answer-only evaluation. BRIDGE provides a targeted testbed for diagnosing reasoning failures in long multimodal documents.

研究动机与目标

评估在长篇、异构科学文献上进行多跳推理的能力，超越仅看最终答案的准确性。
提供显式的中间推理注释和一个结构化的错误分类，以便进行细粒度分析。
同时支持链式与扇出型的推理结构，以诊断 grounding 与证据覆盖的失败。

提出的方法

从 262 篇长篇科学论文（主要是 NLP/视觉领域）构建 BRIDGE，使用对布局感知的提取将 PDF/LaTeX 源解析为文本、表格与图形。
通过两阶段提示框架（结构化挖掘 Structure Mining 与约束引导生成 Constraint-Guided Generation）生成多跳问答对，并设计三类问题类型（因果、比较、抽象）。
应用两阶段质量过滤（基于规则的预过滤和基于大型语言模型的判断，用于评估 grounding、忠实性与推理深度）。
为每对问答对标注跨页面与跨模态的显式证据链，以用于逐步级别的评估。
使用统一的流水线进行评估，采用多种大型语言模型作为生成器，并指定的对答案正确性与证据对齐的 LLM 作为判断者，以 ROUGE/BLEU 作为词汇度量。
结合问题类型、页面深度与证据模态进行详细错误分类和分解来分析结果。

Figure 1. Representative examples of comparative (Cp), abstractive (Ab), and causal reasoning (Re) questions (top), and the corresponding pages where evidences locate (bottom). Mod.: involved modalities (T: text; Tb: table; F: figure).

实验结果

研究问题

RQ1最先进的大语言模型与多模态检索-生成对在需要多跳推理的长篇多模态科学文献上表现如何？
RQ2模型在 grounding 上的依赖程度有多大，以及证据 grounding 与跨模态一致性如何影响最终答案？
RQ3在长篇多模态文档问答中，主要的失败模式（证据聚合、 grounding、覆盖范围）是什么，检索策略如何影响端到端性能？
RQ4不同的问题类型（因果、比较、抽象）和模态（文本、表格、图形）如何影响模型性能与 grounding？

主要发现

BRIDGE 含有 11,857 条带有跨三种任务类型和多样跳跃模式的证据链标注的问答对。
基于 ColPali 的 RAG 检索在长篇多模态、多跳设置中显著降低端到端问答性能，指示检索与 grounding 之间存在不匹配。
基于评审者的度量显示 ChatGPT 在多种策略下达到最高的审计/准确性，强模型普遍优于较小模型，但词汇重叠度量（ROUGE-BLEU）可能与事实 grounding 出现偏离。
当证据转向更深的页面时，性能普遍下降；而表格证据对大多数模型来说通常比文本或图形更具挑战性。
因果推理问题对强模型相对稳定，而比较性问题在基于检索的流水线下最具挑战性。
在跨模态评估中，图形证据对强模型来说比表格更容易，表格主导的问题因 grounding 缺口而受到的影响最大。

Figure 2 . Distribution of QA instances by hop depth, number of distinct pages involved, and hop pattern, broken down by question type (Abstractive, Causal, Comparative)

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。