Skip to main content
QUICK REVIEW

[论文解读] SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing

Tong Zhang, Honglin Lin|arXiv (Cornell University)|Feb 10, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

SciFlow-Bench 引入以结构为先的基准,通过对最终图像进行反向解析以转换为图的规范结构,并利用分层多代理系统确保结构可恢复性。它揭示了模型在视觉保真度与结构正确性之间存在差距。

ABSTRACT

Scientific diagrams convey explicit structural information, yet modern text-to-image models often produce visually plausible but structurally incorrect results. Existing benchmarks either rely on image-centric or subjective metrics insensitive to structure, or evaluate intermediate symbolic representations rather than final rendered images, leaving pixel-based diagram generation underexplored. We introduce SciFlow-Bench, a structure-first benchmark for evaluating scientific diagram generation directly from pixel-level outputs. Built from real scientific PDFs, SciFlow-Bench pairs each source framework figure with a canonical ground-truth graph and evaluates models as black-box image generators under a closed-loop, round-trip protocol that inverse-parses generated diagram images back into structured graphs for comparison. This design enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi-agent system that coordinates planning, perception, and structural reasoning. Experiments show that preserving structural correctness remains a fundamental challenge, particularly for diagrams with complex topology, underscoring the need for structure-aware evaluation.

研究动机与目标

  • Motivate structure-preserving evaluation for scientific diagrams, beyond pixel-level similarity.
  • Automatically construct canonical ground-truth graphs from real framework figures in PDFs.
  • Evaluate models as black-box image generators using a deterministic inverse-parsing round-trip.
  • Analyze the relationship between visual quality and structural correctness across model types.
  • Highlight the role of a hierarchical multi-agent system in enabling consistent graph construction and parsing.

提出的方法

  • Define a round-trip evaluation: from text to a diagram image, then inverse-parse the image back to a graph for comparison with a canonical ground-truth graph.
  • Construct canonical ground-truth graphs automatically from source framework figures via a hierarchical multi-agent system (planning, perception, and reasoning).
  • Use a three-layer HMAS pipeline: Cognitive Planning (Methodologist and Visual Translator), Fine-Grained Perception (Environment Curator, Shape Hunter, Text Spotter), and Structural Reasoning (Topology Coder and Graph Architect).
  • Ground truth and predicted graphs are used to compute graph-level, text-level, and image-level metrics in a deterministic, structure-aware fashion.
  • Provide a unified evaluation protocol where all models are treated as black-box image generators and evaluated on final rendered outputs.
  • Compare pixel-based generators with code-driven baselines (Graphviz) and analyze structural recoverability rather than visual similarity.

实验结果

研究问题

  • RQ1Can generated diagrams be structurally recovered back into coherent graphs that match the canonical ground-truth?
  • RQ2How do different model families (diffusion, multimodal LMs, autoregressive VLMs) perform in preserving structure across easy, medium, and hard topology subsets?
  • RQ3Does there exist a persistent gap between visual plausibility and structural correctness in scientific diagrams across architectures?
  • RQ4What is the impact of individual parsing components (Shape Hunter, Text Spotter) on structural recovery?

主要发现

  • Structural recoverability is a fundamental challenge, with many models preserving visuals but failing to maintain correct topology.
  • Across five domains, node-level and edge-level topology metrics reveal strong structure differences between models, with autoregressive VLMs achieving the best overall structural scores.
  • Diffusion-only models yield high image-level relevance but near-zero graph-level recoverability.
  • Emergent multimodal grounding improves structure over vanilla diffusion, as seen in Qwen-Image’s higher graph-level scores compared to PixArt-Σ.
  • Autoregressive models like Gemini 3 Pro Image achieve the strongest performance, with graph-level scores rising with diagram complexity.
  • Ablation shows Shape Hunter and Text Spotter are crucial for balanced structural recovery; removing either significantly degrades topology or semantic grounding.
  • SciFlow-Bench uncovers a decoupling between visual fidelity and structural reasoning in practical diagram generation.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。