QUICK REVIEW

[论文解读] Davidsonian Scene Graph: Improving Reliability in Fine-grained Evaluation for Text-to-Image Generation

Jaemin Cho, Yushi Hu|arXiv (Cornell University)|Oct 27, 2023

Multimodal Machine Learning Applications被引用 8

一句话总结

本论文提出 Davidsonian Scene Graph (DSG)，一种基于有向无环图的原子性问题框架，用于文本到图像的细粒度评估，提升相较于以往 QG/A 方法的可靠性，并发布 DSG-1k 作为多样化评估基准。

ABSTRACT

Evaluating text-to-image models is notoriously difficult. A strong recent approach for assessing text-image faithfulness is based on QG/A (question generation and answering), which uses pre-trained foundational models to automatically generate a set of questions and answers from the prompt, and output images are scored based on whether these answers extracted with a visual question answering model are consistent with the prompt-based answers. This kind of evaluation is naturally dependent on the quality of the underlying QG and VQA models. We identify and address several reliability challenges in existing QG/A work: (a) QG questions should respect the prompt (avoiding hallucinations, duplications, and omissions) and (b) VQA answers should be consistent (not asserting that there is no motorcycle in an image while also claiming the motorcycle is blue). We address these issues with Davidsonian Scene Graph (DSG), an empirically grounded evaluation framework inspired by formal semantics, which is adaptable to any QG/A frameworks. DSG produces atomic and unique questions organized in dependency graphs, which (i) ensure appropriate semantic coverage and (ii) sidestep inconsistent answers. With extensive experimentation and human evaluation on a range of model configurations (LLM, VQA, and T2I), we empirically demonstrate that DSG addresses the challenges noted above. Finally, we present DSG-1k, an open-sourced evaluation benchmark that includes 1,060 prompts, covering a wide range of fine-grained semantic categories with a balanced distribution. We release the DSG-1k prompts and the corresponding DSG questions.

研究动机与目标

推动对文本到图像（T2I）对齐的更可靠、细粒度评估，优于现有的 QG/A 方法。
提出一个具语义灵感的框架（DSG），产出原子化、唯一性的问题并具有效的依赖性。
证明 DSG 能减少 QG/A 工作流中的重复、幻觉和无效查询。
提供多样化、开源的 DSG-1k 基准，促进 T2I 评估研究。

提出的方法

将提示语义表示为原子命题（实体、属性、关系、全局）的有向无环图（DAG）。
将 QG/A 查询生成为原子、唯一的问题，并以依赖图排列，确保有效的 VQA 查询。
实现一个三步的自动化 DSG 流水线（元组 -> 问题 -> 依赖）并使用面向任务的上下文大模型提示。
在 QG 阶段使用大型语言模型（如 PaLM 2、PaLI），在 QA 阶段使用最先进的 VQA 模块，基于父级答案跳过相关子问题。
通过人工与自动化的精确度/召回、原子性、唯一性以及依赖有效性分析评估可靠性。
提供 DSG-1k，即包含 1,060 条提示的基准，覆盖来自多个数据集的平衡语义类别。

实验结果

研究问题

RQ1如何通过确保原子性、全覆盖语义、唯一性以及有效的问题依赖性，使基于 QG/A 的 T2I 评估更可靠？
RQ2是否存在一个受语义启发的 DSG 框架，可以在 diverse semantic 类别中改进对提示与生成图像的一致性评估？
RQ3当前 VQA 模型在细粒度类别（如文本渲染、计数、抽象属性）方面的局限性是什么？
RQ4DSG-1k 是否为诊断不同模型家族的细粒度 T2I 对齐提供了一个稳健、开放的基准？

主要发现

DSG 在一个 30 条提示样本上对语义元组的人工匹配，精确度 92.2%、召回率 100%；自动化的 GPT-3.5 评估达到 98.3% 精确度和 96.0% 召回率。
与基线相比，DSG 的原子性（96.5%）和唯一性（97.5%）均较高，有效解决非原子化和重复查询的问题。
DSG 的依赖结构确保仅在父级答案为肯定时才提出子问题，整套数据集的自动验证率接近 99%。
每条项 VQA 的相关性表明 DSG+PaLI 对人类判断的对齐度最强（Spearman 0.563，Kendall 0.458）。
DSG-1k 提供多样化、人工标注的提示（共 1,060 条），覆盖 10 个语义类别和多种风格，使细粒度的 T2I 评估成为可能。
评估显示当前 VQA 模型在具体类别（实体、某些空间关系）表现良好，但在抽象属性（计数、文本渲染）及主观性方面存在困难。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。