QUICK REVIEW

[论文解读] Scientific Image Synthesis: Benchmarking, Methodologies, and Downstream Utility

Honglin Lin, Chonghan Qin|arXiv (Cornell University)|Jan 17, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

论文比较像素驱动和代码驱动的科学图像生成，提出 ImgCoder 和 SciGenBench，并证明高保真度的合成图像能提升下游的多模态科学推理。

ABSTRACT

While synthetic data has proven effective for improving scientific reasoning in the text domain, multimodal reasoning remains constrained by the difficulty of synthesizing scientifically rigorous images. Existing Text-to-Image (T2I) models often produce outputs that are visually plausible yet scientifically incorrect, resulting in a persistent visual-logic divergence that limits their value for downstream reasoning. Motivated by recent advances in next-generation T2I models, we conduct a systematic study of scientific image synthesis across generation paradigms, evaluation, and downstream use. We analyze both direct pixel-based generation and programmatic synthesis, and propose ImgCoder, a logic-driven framework that follows an explicit "understand - plan - code" workflow to improve structural precision. To rigorously assess scientific correctness, we introduce SciGenBench, which evaluates generated images based on information utility and logical validity. Our evaluation reveals systematic failure modes in pixel-based models and highlights a fundamental expressiveness-precision trade-off. Finally, we show that fine-tuning Large Multimodal Models (LMMs) on rigorously verified synthetic scientific images yields consistent reasoning gains, with potential scaling trends analogous to the text domain, validating high-fidelity scientific synthesis as a viable path to unlocking massive multimodal reasoning capabilities.

研究动机与目标

Assess the limitations of pixel-based versus code-based scientific image generation systems.
Propose ImgCoder as a logic-driven programmatic framework for higher structural precision.
Create SciGenBench to evaluate information utility and logical validity of scientific images.
Evaluate the downstream utility of synthetic scientific images for training large multimodal models.

提出的方法

Compare pixel-based T2I generation with programmatic, code-driven image synthesis.
Develop ImgCoder using an Understand → Plan → Code workflow with a Think-before-Act strategy.
Construct SciGenBench with a two-level Subject–Image Type taxonomy and atomic visual quizzes.
Adopt a hybrid evaluation framework using LMM-as-Judge, inverse validation, standard metrics, and downstream performance.
Analyze precision–expressiveness trade-offs and failure modes across paradigms.

实验结果

研究问题

RQ1RQ1: How do current state-of-the-art models perform in scientific image generation across paradigms?
RQ2RQ2: What are the trade-offs between generative (pixel-based) and programmatic (code-based) approaches?
RQ3RQ3: Do synthetic scientific images improve downstream multimodal reasoning when used for training?

主要发现

Model	R_inv (%) ↑	LMM-as-Judge (0–2) ↑	Standard-metrics	C&F	L&P	R&O	SP	E&R	PSNR ↑	SSIM ↑
HunyuanImage-3.0	30.79	0.39	0.78	1.44	0.56	0.81	12.21	0.82	25.01	93.27
Qwen-Image	38.86	0.24	0.70	1.48	0.30	0.76	9.63	0.78	25.02	120.42
GPT-Image-1	42.97	0.57	1.37	1.90	0.84	1.19	13.07	0.84	25.14	77.31
Seedream-4.0	52.67	0.44	0.94	1.67	0.55	0.95	10.65	0.74	25.02	98.22
Nanobanana	57.75	0.43	0.92	1.60	0.60	1.15	14.12	0.85	25.13	104.70
Flux2-flex	58.83	0.48	1.06	1.70	0.67	1.20	14.11	0.85	25.10	96.74
GPT-Image-1.5	63.52	0.98	1.70	1.97	1.17	1.62	14.79	0.88	25.16	112.52
Nanobanana-Pro	73.41	1.59	1.87	1.98	1.72	1.93	12.02	0.81	25.01	87.72
ImgCoder	Qwen3-ImgCoder	56.38	1.30	1.62	1.39	1.29	14.71	0.86	25.21	121.55
Gemini-3-Flash-ImgCoder	76.93	1.88	1.88	1.92	1.91	1.92	14.63	0.85	25.18	117.83
Gemini-3-Pro-ImgCoder	77.87	1.93	1.91	1.93	1.90	1.84	14.59	0.86	25.16	107.67

Pixel-based models show strong visual fidelity but poorer structural correctness for scientific diagrams.
Code-driven ImgCoder achieves higher structural precision and reasoning-related scores, with top variants attaining the highest inverse validation and judge scores.
SciGenBench reveals a precision–expressiveness trade-off between paradigms and identifies persistent failure modes, especially domain knowledge and dense data errors.
Fine-tuning large multimodal models on rigorously verified synthetic scientific images yields consistent improvements in scientific reasoning.
Data quality and filtering of synthetic images significantly affect downstream performance and exhibit scalable gains with more data.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。