QUICK REVIEW

[论文解读] GenEval: An Object-Focused Framework for Evaluating Text-to-Image Alignment

Dhruba Ghosh, Hanna Hajishirzi|arXiv (Cornell University)|Oct 17, 2023

Multimodal Machine Learning Applications被引用 10

一句话总结

GenEval 引入了一个面向对象的自动化框架，使用对象检测和颜色分类来验证细粒度的文本到图像生成，与在组合任务上的人工判断保持较强的一致性。

ABSTRACT

Recent breakthroughs in diffusion models, multimodal pretraining, and efficient finetuning have led to an explosion of text-to-image generative models. Given human evaluation is expensive and difficult to scale, automated methods are critical for evaluating the increasingly large number of new models. However, most current automated evaluation metrics like FID or CLIPScore only offer a holistic measure of image quality or image-text alignment, and are unsuited for fine-grained or instance-level analysis. In this paper, we introduce GenEval, an object-focused framework to evaluate compositional image properties such as object co-occurrence, position, count, and color. We show that current object detection models can be leveraged to evaluate text-to-image models on a variety of generation tasks with strong human agreement, and that other discriminative vision models can be linked to this pipeline to further verify properties like object color. We then evaluate several open-source text-to-image models and analyze their relative generative capabilities on our benchmark. We find that recent models demonstrate significant improvement on these tasks, though they are still lacking in complex capabilities such as spatial relations and attribute binding. Finally, we demonstrate how GenEval might be used to help discover existing failure modes, in order to inform development of the next generation of text-to-image models. Our code to run the GenEval framework is publicly available at https://github.com/djghosh13/geneval.

研究动机与目标

激发需要对文本到图像模型进行细粒度、可扩展的评估，超越像 FID 或 CLIPScore 这样的整体指标。
提出 GenEval，一种自动化、以对象为中心的框架，用于验证生成图像中的提示对象及其属性。
展示与人类判断的一致性，并分析现代开源文本到图像模型在组合任务中的表现。
展示 GenEval 如何揭示失败模式，以指导未来的模型开发。

提出的方法

将提示分解为对象类型、数量、颜色和相对位置。
使用基于 MMDetection 的最先进目标检测器（在 MS COCO 上的 Mask2Former）来验证对象的存在并推导边界框/分割。
从检测器输出中提取计数和相对位置以评估计数和空间关系。
在裁剪的对象区域上使用基于零样本的 CLIP 颜色分类器对颜色进行分类。
计算每个图像的二进制正确性分数，指示是否满足所有提示要素，并提供失败说明。
将 GenEval 结果与人工标注和 CLIPScore 进行比较，以评估与人类判断的一致性。

Figure 1: Visualization of GenEval . Modern object detection models can be used to automatically verify text-to-image generations. The detected bounding boxes and segmentation masks can be used to verify object presence, count, and position, and then passed to downstream discriminative vision models

实验结果

研究问题

RQ1自动化、以对象为中心的验证是否能在复杂组合提示上达到比整体指标更高的人类判断一致性？
RQ2现代 T2I 模型在计数、定位和属性绑定任务上取得了多大程度的提升，以及在哪些方面仍然存在困难？
RQ3对象检测器和颜色分类器组件如何在跨 T2I 模型的可靠、可解释评估中发挥作用？
RQ4GenEval 在当前开源模型中揭示了哪些失败模式，能够为未来改进提供指引？

主要发现

模型	单个对象	两个对象	计数	颜色	位置	绑定	总体	CLIPScore	人工评估
CLIP retrieval	0.89	0.22	0.37	0.62	0.03	0.00	0.35	27.8	0.42
minDALL-E	0.73	0.11	0.12	0.37	0.02	0.01	0.23	27.3	—
SDv1.5	0.97	0.38	0.35	0.76	0.04	0.06	0.43	33.5	—
SDv2.1	0.98	0.51	0.44	0.85	0.07	0.17	0.50	36.2	0.57
SD-XL	0.98	0.74	0.39	0.85	0.15	0.23	0.55	36.7	—
IF-XL	0.97	0.74	0.66	0.81	0.13	0.35	0.61	36.5	0.72

GenEval 在总体上与人工标注者的一致性达到 83%，接近 88% 的评注者间一致性，且在复杂任务上高于阈值化的 CLIPScore。
在各任务中，计数、位置和属性绑定与人类一致性的差距最大，突出 T2I 模型仍面临的挑战。
IF-XL 与 SD-XL 相对于早期模型有显著改进，其中 IF-XL 实现了最佳的总体 GenEval 分数（0.61），SD-XL 落后不远（0.55）。
在各模型中，位置和属性绑定仍然很困难，这些任务的最好结果大约在 7-15% 左右。
GenEval 的二进制逐图像验证和可解释的失败描述有助于调试和理解模型行为。
该框架揭示了具体的失败模式（例如颜色交换、在位置上的左/右偏差），可以指导目标化改进生成模型。

Figure 2: Comparison between GenEval and CLIPScore. CLIPScore returns a scalar value indicating image-text alignment, whereas GenEval breaks the prompt down into correct and incorrect elements before producing a final binary score. Compared to CLIPScore, GenEval obtains higher agreement with human j

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。