QUICK REVIEW

[论文解读] Benchmarking Spatial Relationships in Text-to-Image Generation

Tejas Gokhale, Hamid Palangi|arXiv (Cornell University)|Dec 20, 2022

Multimodal Machine Learning Applications被引用 25

一句话总结

论文介绍 VISOR，一种用于评估文本到图像模型在空间理解方面的新自动化指标，以及 SR 2D，一个包含 25,280 句描述两对象空间关系的大规模数据集，用于对比最新的 T2I 模型。研究发现写实并不等同于空间准确性，并揭示了对象生成和关系呈现中的显著偏差。

ABSTRACT

Spatial understanding is a fundamental aspect of computer vision and integral for human-level reasoning about images, making it an important component for grounded language understanding. While recent text-to-image synthesis (T2I) models have shown unprecedented improvements in photorealism, it is unclear whether they have reliable spatial understanding capabilities. We investigate the ability of T2I models to generate correct spatial relationships among objects and present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image. To benchmark existing models, we introduce a dataset, $\mathrm{SR}_{2D}$, that contains sentences describing two or more objects and the spatial relationships between them. We construct an automated evaluation pipeline to recognize objects and their spatial relationships, and employ it in a large-scale evaluation of T2I models. Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them. Our analyses demonstrate several biases and artifacts of T2I models such as the difficulty with generating multiple objects, a bias towards generating the first object mentioned, spatially inconsistent outputs for equivalent relationships, and a correlation between object co-occurrence and spatial understanding capabilities. We conduct a human study that shows the alignment between VISOR and human judgement about spatial understanding. We offer the $\mathrm{SR}_{2D}$ dataset and the VISOR metric to the community in support of T2I reasoning research.

研究动机与目标

评估现代文本到图像模型是否能够准确渲染文本提示中描述的空间关系。
提供一个自动化、与人类对齐的度量来量化 T2I 输出的空间理解。
创建一个大规模数据集 (SR 2D)，捕捉常见对象和二维空间关系，用于对模型进行基准测试。
调查偏差、失败模式，以及对象共现与空间理解之间的相关性。

提出的方法

定义 VISOR 指标以验证图片中生成对象之间的空间关系。
使用 25,280 条描述左/右/上/下关系的提示来构建 SR 2D 数据集，涵盖 80 个 MS-COCO 对象。
使用带 CLIP 主干的 OWL-ViT 的自动对象检测器来检测生成图像中的对象并推断空间关系。
在每个提示的四个图像上对多位领先 T2I 模型（GLIDE、DALLE-mini、CogView2、DALLE-v2、Stable Diffusion及其变体）进行基准评测。
进行一项人类研究（MTurk）以验证 VISOR 与人类对空间理解判断的一致性。

实验结果

研究问题

RQ1最先进的文本到图像模型是否能可靠地呈现多对象之间的特定空间关系？
RQ2现有的自动化多模态评估（如 CLIPScore、基于说明的指标）与真实空间正确性之间的相关性如何？
RQ3在生成多个对象及其空间关系时，主要的失败模式和偏差有哪些？
RQ4VISOR 在多大程度上反映人类对 T2I 输出的空间理解判断？
RQ5哪些因素（如对象共现、提示结构）会影响空间呈现的表现？

主要发现

所有模型都表现出强烈的写实性，但在多对象关系上的空间理解较弱。
最佳模型（DALLE-v2）在 VISOR uncond 约为 60%、VISOR 4 约为 7.5%，这表明严格的空间正确性存在很大差距。
OA（对象存在）在大多数模型中仍然不高，当两个对象都必须出现时，暴露了对象生成与关系准确性之间的差距。
模型表现出偏差，例如偏向首提及的对象、在常见共现对上的成功率较高，以及对象合并。
VISOR 与人类判断相关，验证了其用于评估 T2I 模型的空间推理的有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。