[论文解读] SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing
SciFlow-Bench 引入以结构为先的基准,通过对最终图像进行反向解析以转换为图的规范结构,并利用分层多代理系统确保结构可恢复性。它揭示了模型在视觉保真度与结构正确性之间存在差距。
Scientific diagrams convey explicit structural information, yet modern text-to-image models often produce visually plausible but structurally incorrect results. Existing benchmarks either rely on image-centric or subjective metrics insensitive to structure, or evaluate intermediate symbolic representations rather than final rendered images, leaving pixel-based diagram generation underexplored. We introduce SciFlow-Bench, a structure-first benchmark for evaluating scientific diagram generation directly from pixel-level outputs. Built from real scientific PDFs, SciFlow-Bench pairs each source framework figure with a canonical ground-truth graph and evaluates models as black-box image generators under a closed-loop, round-trip protocol that inverse-parses generated diagram images back into structured graphs for comparison. This design enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi-agent system that coordinates planning, perception, and structural reasoning. Experiments show that preserving structural correctness remains a fundamental challenge, particularly for diagrams with complex topology, underscoring the need for structure-aware evaluation.
研究动机与目标
- Motivate structure-preserving evaluation for scientific diagrams, beyond pixel-level similarity.
- Automatically construct canonical ground-truth graphs from real framework figures in PDFs.
- Evaluate models as black-box image generators using a deterministic inverse-parsing round-trip.
- Analyze the relationship between visual quality and structural correctness across model types.
- Highlight the role of a hierarchical multi-agent system in enabling consistent graph construction and parsing.
提出的方法
- Define a round-trip evaluation: from text to a diagram image, then inverse-parse the image back to a graph for comparison with a canonical ground-truth graph.
- Construct canonical ground-truth graphs automatically from source framework figures via a hierarchical multi-agent system (planning, perception, and reasoning).
- Use a three-layer HMAS pipeline: Cognitive Planning (Methodologist and Visual Translator), Fine-Grained Perception (Environment Curator, Shape Hunter, Text Spotter), and Structural Reasoning (Topology Coder and Graph Architect).
- Ground truth and predicted graphs are used to compute graph-level, text-level, and image-level metrics in a deterministic, structure-aware fashion.
- Provide a unified evaluation protocol where all models are treated as black-box image generators and evaluated on final rendered outputs.
- Compare pixel-based generators with code-driven baselines (Graphviz) and analyze structural recoverability rather than visual similarity.
实验结果
研究问题
- RQ1Can generated diagrams be structurally recovered back into coherent graphs that match the canonical ground-truth?
- RQ2How do different model families (diffusion, multimodal LMs, autoregressive VLMs) perform in preserving structure across easy, medium, and hard topology subsets?
- RQ3Does there exist a persistent gap between visual plausibility and structural correctness in scientific diagrams across architectures?
- RQ4What is the impact of individual parsing components (Shape Hunter, Text Spotter) on structural recovery?
主要发现
- Structural recoverability is a fundamental challenge, with many models preserving visuals but failing to maintain correct topology.
- Across five domains, node-level and edge-level topology metrics reveal strong structure differences between models, with autoregressive VLMs achieving the best overall structural scores.
- Diffusion-only models yield high image-level relevance but near-zero graph-level recoverability.
- Emergent multimodal grounding improves structure over vanilla diffusion, as seen in Qwen-Image’s higher graph-level scores compared to PixArt-Σ.
- Autoregressive models like Gemini 3 Pro Image achieve the strongest performance, with graph-level scores rising with diagram complexity.
- Ablation shows Shape Hunter and Text Spotter are crucial for balanced structural recovery; removing either significantly degrades topology or semantic grounding.
- SciFlow-Bench uncovers a decoupling between visual fidelity and structural reasoning in practical diagram generation.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。