QUICK REVIEW

[论文解读] DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models

Jaemin Cho, Abhay Zala|arXiv (Cornell University)|Feb 8, 2022

Multimodal Machine Learning Applications被引用 24

一句话总结

该论文介绍 PaintSkills，一种用于衡量文本到图像模型在视觉推理（对象识别、计数、空间关系）方面的诊断数据集，并使用自动化和人工评估来评估生成图像中的性别/肤色偏见。

ABSTRACT

Recently, DALL-E, a multimodal transformer language model, and its variants, including diffusion models, have shown high-quality text-to-image generation capabilities. However, despite the realistic image generation results, there has not been a detailed analysis of how to evaluate such models. In this work, we investigate the visual reasoning capabilities and social biases of different text-to-image models, covering both multimodal transformer language models and diffusion models. First, we measure three visual reasoning skills: object recognition, object counting, and spatial relation understanding. For this, we propose PaintSkills, a compositional diagnostic evaluation dataset that measures these skills. Despite the high-fidelity image generation capability, a large gap exists between the performance of recent models and the upper bound accuracy in object counting and spatial relation understanding skills. Second, we assess the gender and skin tone biases by measuring the gender/skin tone distribution of generated images across various professions and attributes. We demonstrate that recent text-to-image generation models learn specific biases about gender and skin tone from web image-text pairs. We hope our work will help guide future progress in improving text-to-image generation models on visual reasoning skills and learning socially unbiased representations. Code and data: https://github.com/j-min/DallEval

研究动机与目标

介绍 PaintSkills，一种用于评估文本到图像（T2I）模型在组成性视觉推理方面的诊断数据集（对象识别、计数、空间关系）。
量化当前模型在计数和空间推理方面的表现，相对于上限准确度。
使用自动检测器和人工评估评估生成图像中的性别与肤色偏见。
分析生成图像中的偏见如何反映网络图像-文本对的训练数据。
为改进视觉推理和减少 T2I 模型中的社会偏见提供指南。

提出的方法

定义三种视觉推理能力（对象识别、对象计数、空间关系理解），并通过基于 DETR 的对生成图像的对象检测来衡量。
使用基于 Unity 的三维仿真器创建 PaintSkills，采用对象/关系分布的均匀性以避免偏见。
在 PaintSkills 测试分集上训练 DETR 检测器，以获得上界 oracle 准确性。
生成用于偏见分析的诊断提示（性别和职业），并使用自动检测器（BLIP-2、FAN、TRUST）检测性别、肤色和属性，同时进行人工验证。
通过分布和平均绝对偏差 MAD 与均匀基线进行比较来量化性别/肤色偏见。

实验结果

研究问题

RQ1与 oracle 相比，当前的文本到图像模型在计数对象和理解空间关系方面有多大能力？
RQ2在以职业相关描述为提示时，文本到图像模型是否会出现性别和肤色偏见？
RQ3自动检测器在评估生成图像的视觉推理和偏见时与人类判断的一致性有多高？
RQ4训练数据中的哪些因素促成了观察到的偏见，评估如何引导改进？

主要发现

评估者	图像	对象识别 (%)	对象计数 (%)	空间关系理解 (%)	平均值 (%)
GT (oracle)	N/A	100.0	97.8	96.2	98.0
GT shuffled (random)	N/A	6.3	1.7	0.3	2.8
DALL-E Small	N/A	57.5	18.2	2.4	26.0
minDALL-E	N/A	89.9	47.5	50.7	62.7
Stable Diffusion	N/A	96.2	37.8	7.9	47.3

Stable Diffusion 取得最高的对象识别准确率（96.2%），但在计数（37.8%）和空间关系（7.9%）方面落后，表明在复杂推理方面存在差距。
minDALL-E 在对象计数（47.5%）和空间（50.7%）方面比 Stable Diffusion 更平衡，但在对象识别（89.9%）方面落后。
基于 DETR 的评估在各技能上与人类判断一致，支持自动度量方法的有效性。
模型在性别偏见方面随职业而异，提示中普遍倾向于男性表征，并且各模型（minDALL-E、Karlo、Stable Diffusion）的偏见程度不同。
肤色偏见在各模型中呈现集中在中等 MST 值（5-6）附近的趋势，MAD 分数指示分布不均匀。
PaintSkills 数据集的规模在部分数据（50-100%）下足以学习这些技能，表明评估框架具有鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。