QUICK REVIEW

[论文解读] S1-MMAlign: A Large-Scale, Multi-Disciplinary Dataset for Scientific Figure-Text Understanding

He Wang, Longteng Guo|arXiv (Cornell University)|Jan 1, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

简要结论：引入 S1-MMAlign，这是一个来自 250 万篇开源论文的 1550 万图文数据集，具有 AI 增强的字幕以弥合图 figure 与科学文本之间的语义鸿沟。展示通过上下文感知的再描述实现跨模态对齐的改进。

ABSTRACT

Multimodal learning has revolutionized general domain tasks, yet its application in scientific discovery is hindered by the profound semantic gap between complex scientific imagery and sparse textual descriptions. We present S1-MMAlign, a large-scale, multi-disciplinary multimodal dataset comprising over 15.5 million high-quality image-text pairs derived from 2.5 million open-access scientific papers. Spanning disciplines from physics and biology to engineering, the dataset captures diverse visual modalities including experimental setups, heatmaps, and microscopic imagery. To address the pervasive issue of weak alignment in raw scientific captions, we introduce an AI-ready semantic enhancement pipeline that utilizes the Qwen-VL multimodal large model series to recaption images by synthesizing context from paper abstracts and citation contexts. Technical validation demonstrates that this enhancement significantly improves data quality: SciBERT-based pseudo-perplexity metrics show reduced semantic ambiguity, while CLIP scores indicate an 18.21% improvement in image-text alignment. S1-MMAlign provides a foundational resource for advancing scientific reasoning and cross-modal understanding in the era of AI for Science. The dataset is publicly available at https://huggingface.co/datasets/ScienceOne-AI/S1-MMAlign.

研究动机与目标

解决发表物中复杂科学图像与稀疏字幕之间的语义鸿沟。
提供一个大规模、跨学科的多模态语料库，以支持科学推理模型。
开发一个 AI 驱动的语义增强流水线，以生成密集、以上下文为基础的图像字幕。

提出的方法

从 arXiv、bioRxiv、medRxiv、ChemRxiv 和 Nature Communications 获取数据以收集图文对。
对 LaTeX/PDF 源进行预处理，提取图像与字幕，并将视觉内容转换为 PNG/JPG。
使用 Qwen3-VL 及 SigLIP-2 编码器的语义增强流水线来生成具上下文信息的字幕。
从论文标题、摘要和本地引文上下文中注入知识，使字幕在科学叙述中得到 grounding。
在 8x H100 GPU 集群上使用 vLLM 和 PagedAttention 进行高吞吐、并行推理以实现可扩展性。
将输出存储在 JSONL 元数据和带有密码学完整性校验（Xet）的分片图像档案中。

实验结果

研究问题

RQ1如何弥合科学图像与文本之间的语义鸿沟，从而提升多模态理解？
RQ2上下文感知的再描述是否能提升科学图像的跨模态对齐？
RQ3在大规模科学图像-文本数据集中，覆盖的学科领域和可视模态有哪些？
RQ4语义增强的语料库是否能减少科学多模态模型中的幻觉？

主要发现

增强字幕在 CLIP 图像-文本对齐方面较原始字幕平均提升 18.21%。
增强字幕具有更高的语言质量（SciBERT 伪 pPL 向左偏移，表示困惑度降低）。
字幕长度从 267 ± 261 增加到 759 ± 251 个字符，变异系数（CV）约为 33%。
数据集覆盖物理、计算机科学、天文学、生物学、数学与工程，物理学和计算机科学占比超过一半。
数据处理流水线在 8x H100 GPUs 上使用 vLLM 进行大规模图-字幕再描述时实现可扩展处理。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。