QUICK REVIEW

[论文解读] SciMDR: Benchmarking and Advancing Scientific Multimodal Document Reasoning

Ziyu Chen, Yilun Zhao|arXiv (Cornell University)|Mar 12, 2026

Topic Modeling被引用 0

一句话总结

本论文提出一个合成与再定位的两阶段框架，以生成大规模、忠实且真实的科学多模态问答数据与基准，从而提升多模态科学文档推理模型的训练与评估效果。

ABSTRACT

Constructing scientific multimodal document reasoning datasets for foundation model training involves an inherent trade-off among scale, faithfulness, and realism. To address this challenge, we introduce the synthesize-and-reground framework, a two-stage pipeline comprising: (1) Claim-Centric QA Synthesis, which generates faithful, isolated QA pairs and reasoning on focused segments, and (2) Document-Scale Regrounding, which programmatically re-embeds these pairs into full-document tasks to ensure realistic complexity. Using this framework, we construct SciMDR, a large-scale training dataset for cross-modal comprehension, comprising 300K QA pairs with explicit reasoning chains across 20K scientific papers. We further construct SciMDR-Eval, an expert-annotated benchmark to evaluate multimodal comprehension within full-length scientific workflows. Experiments demonstrate that models fine-tuned on SciMDR achieve significant improvements across multiple scientific QA benchmarks, particularly in those tasks requiring complex document-level reasoning.

研究动机与目标

在合成科学问答数据生成中解决忠实性与真实感之间的权衡。
提出一个两阶段管线，以在包含真实全文档上下文的情况下生成可扩展且高忠实性的数据。
创建一个专家注释的基准，用于评估长文档中的多模态科学文档推理。
评估合成数据质量对现实世界科学问答性能的影响。

提出的方法

引入以陈述为中心的 QA 合成，在孤立上下文中生成原子性、具有 Reasoning 链的问答对并具备推理过程。
真实事实陈述引导反向推理，确保输出的忠实性。
应用文档级再定位，将问答对嵌入到全文档上下文中，附带明确的信息定位指令。
使用信息定位模板教会模型如何在完整论文中定位证据。
采用两阶段训练：阶段1在 VQA/TQA 数据上训练，阶段2在 MQA 数据上训练，最终在 SPIQA 上进行微调。

Figure 1: The Faithfulness-Realism Dilemma in scientific data synthesis and our proposed solution. Existing approaches face an inherent trade-off: simplifying context ensures faithfulness but lacks real-world complexity, while generating directly from full documents ensures realism but risks halluci

实验结果

研究问题

RQ1RQ1: 在合成数据上进行微调是否提升模型在科学推理任务上的表现？
RQ2RQ2: 提出的数据合成管线是否能生成对现实世界科学推理有实质性提升的训练数据？
RQ3RQ3: 模型在多模态问答任务中如何处理长上下文、嘈杂的科学文档？

主要发现

构建了大规模的训练数据集：大致 300K 对包含推理链的问答对，来自 20K 篇论文。
建立了一个专家注释的基准（907 对问答对），用于长文档中的证据定位的文档级多模态问答。
对 Qwen2.5-VL-7B 和 LLaVA-1.5-7B 在生成数据上的微调，在多个基准上取得显著提升，尤其是需要复杂文档级推理的任务。
消融研究表明，合成数据中的高质量推理链对于在长上下文噪声下学习鲁棒的多模态推理具有价值。
模型比较表明，所提方法在若干科学问答基准（ChartQA、CharXiv、SPIQA）上可超越基线。
论文指出，即使是相对较小的模型（7B）在使用他们的高忠实性数据进行训练后，也能达到或超过更大型的基线。

Figure 2: Overview of the synthesize-and-reground framework. The pipeline operates in two stages: Claim-Centric QA Synthesis ensures faithfulness by extracting atomic claims and employing backward reasoning to generate QA pairs with chain-of-thought; Document-Scale Re-grounding ensures realism by re

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。