QUICK REVIEW

[论文解读] SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing

Tong Zhang, Honglin Lin|arXiv (Cornell University)|Feb 10, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

SciFlow-Bench 引入以结构为先的基准，通过对最终图像进行反向解析以转换为图的规范结构，并利用分层多代理系统确保结构可恢复性。它揭示了模型在视觉保真度与结构正确性之间存在差距。

ABSTRACT

Scientific diagrams convey explicit structural information, yet modern text-to-image models often produce visually plausible but structurally incorrect results. Existing benchmarks either rely on image-centric or subjective metrics insensitive to structure, or evaluate intermediate symbolic representations rather than final rendered images, leaving pixel-based diagram generation underexplored. We introduce SciFlow-Bench, a structure-first benchmark for evaluating scientific diagram generation directly from pixel-level outputs. Built from real scientific PDFs, SciFlow-Bench pairs each source framework figure with a canonical ground-truth graph and evaluates models as black-box image generators under a closed-loop, round-trip protocol that inverse-parses generated diagram images back into structured graphs for comparison. This design enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi-agent system that coordinates planning, perception, and structural reasoning. Experiments show that preserving structural correctness remains a fundamental challenge, particularly for diagrams with complex topology, underscoring the need for structure-aware evaluation.

研究动机与目标

Motivate structure-preserving evaluation for scientific diagrams, beyond pixel-level similarity.
Automatically construct canonical ground-truth graphs from real framework figures in PDFs.
Evaluate models as black-box image generators using a deterministic inverse-parsing round-trip.
Analyze the relationship between visual quality and structural correctness across model types.
Highlight the role of a hierarchical multi-agent system in enabling consistent graph construction and parsing.

提出的方法

Define a round-trip evaluation: from text to a diagram image, then inverse-parse the image back to a graph for comparison with a canonical ground-truth graph.
Construct canonical ground-truth graphs automatically from source framework figures via a hierarchical multi-agent system (planning, perception, and reasoning).
Use a three-layer HMAS pipeline: Cognitive Planning (Methodologist and Visual Translator), Fine-Grained Perception (Environment Curator, Shape Hunter, Text Spotter), and Structural Reasoning (Topology Coder and Graph Architect).
Ground truth and predicted graphs are used to compute graph-level, text-level, and image-level metrics in a deterministic, structure-aware fashion.
Provide a unified evaluation protocol where all models are treated as black-box image generators and evaluated on final rendered outputs.
Compare pixel-based generators with code-driven baselines (Graphviz) and analyze structural recoverability rather than visual similarity.

实验结果

研究问题

RQ1Can generated diagrams be structurally recovered back into coherent graphs that match the canonical ground-truth?
RQ2How do different model families (diffusion, multimodal LMs, autoregressive VLMs) perform in preserving structure across easy, medium, and hard topology subsets?
RQ3Does there exist a persistent gap between visual plausibility and structural correctness in scientific diagrams across architectures?
RQ4What is the impact of individual parsing components (Shape Hunter, Text Spotter) on structural recovery?

主要发现

Structural recoverability is a fundamental challenge, with many models preserving visuals but failing to maintain correct topology.
Across five domains, node-level and edge-level topology metrics reveal strong structure differences between models, with autoregressive VLMs achieving the best overall structural scores.
Diffusion-only models yield high image-level relevance but near-zero graph-level recoverability.
Emergent multimodal grounding improves structure over vanilla diffusion, as seen in Qwen-Image’s higher graph-level scores compared to PixArt-Σ.
Autoregressive models like Gemini 3 Pro Image achieve the strongest performance, with graph-level scores rising with diagram complexity.
Ablation shows Shape Hunter and Text Spotter are crucial for balanced structural recovery; removing either significantly degrades topology or semantic grounding.
SciFlow-Bench uncovers a decoupling between visual fidelity and structural reasoning in practical diagram generation.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。