QUICK REVIEW

[論文レビュー] Scientific Image Synthesis: Benchmarking, Methodologies, and Downstream Utility

Honglin Lin, Chonghan Qin|arXiv (Cornell University)|Jan 17, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

要約：本論文はピクセルベースとコード駆動の科学画像生成を比較し、ImgCoderとSciGenBenchを導入し、高忠実度の合成画像が下流のマルチモーダル科学推論を改善することを示している。

ABSTRACT

While synthetic data has proven effective for improving scientific reasoning in the text domain, multimodal reasoning remains constrained by the difficulty of synthesizing scientifically rigorous images. Existing Text-to-Image (T2I) models often produce outputs that are visually plausible yet scientifically incorrect, resulting in a persistent visual-logic divergence that limits their value for downstream reasoning. Motivated by recent advances in next-generation T2I models, we conduct a systematic study of scientific image synthesis across generation paradigms, evaluation, and downstream use. We analyze both direct pixel-based generation and programmatic synthesis, and propose ImgCoder, a logic-driven framework that follows an explicit "understand - plan - code" workflow to improve structural precision. To rigorously assess scientific correctness, we introduce SciGenBench, which evaluates generated images based on information utility and logical validity. Our evaluation reveals systematic failure modes in pixel-based models and highlights a fundamental expressiveness-precision trade-off. Finally, we show that fine-tuning Large Multimodal Models (LMMs) on rigorously verified synthetic scientific images yields consistent reasoning gains, with potential scaling trends analogous to the text domain, validating high-fidelity scientific synthesis as a viable path to unlocking massive multimodal reasoning capabilities.

研究の動機と目的

ピクセルベースとコードベースの科学画像生成システムの限界を評価する。
より高い構造的精度のための論理駆動的なプログラムフレームワークとしてImgCoderを提案する。
SciGenBenchを作成し、科学画像の情報有用性と論理的妥当性を評価する。
合成科学画像を用いた大規模マルチモーダルモデルのトレーニングにおける下流的有用性を評価する。

提案手法

ピクセルベースのT2I生成とプログラム的なコード駆動画像合成を比較する。
Think-before-Act戦略を用いたUnderstand → Plan → CodeのワークフローでImgCoderを開発する。
二段階のSubject–Image Type分類と原子レベルのビジュアルクイズを備えたSciGenBenchを構築する。
LMM-as-Judge、逆検証、標準指標、および下流性能を組み合わせたハイブリッド評価フレームワークを採用する。
パラダイム間での精度と表現力のトレードオフと失敗モードを分析する。

実験結果

リサーチクエスチョン

RQ1RQ1: 現在の最先端モデルは、パラダイム間で科学画像生成においてどのように性能を示しているか？
RQ2RQ2: ジェネレーティブ（ピクセルベース）とプログラム的（コードベース）アプローチのトレードオフは何か？
RQ3RQ3: 合成科学画像はトレーニングに用いると下流のマルチモーダル推論を改善するか？

主な発見

モデル	R_inv (%) ↑	LMM-as-Judge (0–2) ↑	標準指標	C&F	L&P	R&O	SP	PSNR ↑	SSIM ↑	CLIP ↑
HunyuanImage-3.0	30.79	0.39	0.78	1.44	0.56	0.81	12.21	0.82	25.01	93.27
Qwen-Image	38.86	0.24	0.70	1.48	0.30	0.76	9.63	0.78	25.02	120.42
GPT-Image-1	42.97	0.57	1.37	1.90	0.84	1.19	13.07	0.84	25.14	77.31
Seedream-4.0	52.67	0.44	0.94	1.67	0.55	0.95	10.65	0.74	25.02	98.22
Nanobanana	57.75	0.43	0.92	1.60	0.60	1.15	14.12	0.85	25.13	104.70
Flux2-flex	58.83	0.48	1.06	1.70	0.67	1.20	14.11	0.85	25.10	96.74
GPT-Image-1.5	63.52	0.98	1.70	1.97	1.17	1.62	14.79	0.88	25.16	112.52
Nanobanana-Pro	73.41	1.59	1.87	1.98	1.72	1.93	12.02	0.81	25.01	87.72
ImgCoder	Qwen3-ImgCoder	56.38	1.30	1.62	1.39	1.29	14.71	0.86	25.21	121.55
Gemini-3-Flash-ImgCoder	76.93	1.88	1.88	1.92	1.91	1.92	14.63	0.85	25.18	117.83
Gemini-3-Pro-ImgCoder	77.87	1.93	1.91	1.93	1.90	1.84	14.59	0.86	25.16	107.67

ピクセルベースのモデルは視覚的忠実度は高いが、科学図の構造的正確性は劣る。
コード駆動のImgCoderは構造的精度と推論関連スコアが高く、トップのバリアントは逆検証とジャッジスコアで最高を達成。
SciGenBenchはパラダイム間の精度と表現力のトレードオフを示し、特にドメイン知識と密データエラーなどの持続的な失敗モードを特定。
厳密に検証された合成科学画像で大規模マルチモーダルモデルを微調整すると、科学的推論が一貫して改善される。
データ品質と合成画像のフィルタリングは下流性能に大きく影響し、データ量を増やすとスケーラブルな成果が得られる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。