QUICK REVIEW

[論文レビュー] DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models

Jaemin Cho, Abhay Zala|arXiv (Cornell University)|Feb 8, 2022

Multimodal Machine Learning Applications被引用数 24

ひとこと要約

この論文はPaintSkillsを導入し、テキスト-to-画像モデルの視覚的推論（物体認識、カウント、空間関係）を測定する診断データセットを提示し、自動評価と人間評価を用いて生成画像の男女・肌色バイアスを評価します。

ABSTRACT

Recently, DALL-E, a multimodal transformer language model, and its variants, including diffusion models, have shown high-quality text-to-image generation capabilities. However, despite the realistic image generation results, there has not been a detailed analysis of how to evaluate such models. In this work, we investigate the visual reasoning capabilities and social biases of different text-to-image models, covering both multimodal transformer language models and diffusion models. First, we measure three visual reasoning skills: object recognition, object counting, and spatial relation understanding. For this, we propose PaintSkills, a compositional diagnostic evaluation dataset that measures these skills. Despite the high-fidelity image generation capability, a large gap exists between the performance of recent models and the upper bound accuracy in object counting and spatial relation understanding skills. Second, we assess the gender and skin tone biases by measuring the gender/skin tone distribution of generated images across various professions and attributes. We demonstrate that recent text-to-image generation models learn specific biases about gender and skin tone from web image-text pairs. We hope our work will help guide future progress in improving text-to-image generation models on visual reasoning skills and learning socially unbiased representations. Code and data: https://github.com/j-min/DallEval

研究の動機と目的

PaintSkillsを導入し、T2Iモデルにおける構成的視覚推論を評価する診断データセットを提示する（物体認識、カウント、空間関係）。
現在のモデルがカウントと空間推論を上限精度と比較してどれだけうまく機能するかを定量化する。
自動検出器と人間評価を用いて生成画像の性別および肌色バイアスを評価する。
生成画像のバイアスがWebの画像-テキストペアからの訓練データをどのように反映しているかを分析する。
T2Iモデルにおける視覚推論の改善と社会的バイアスの低減のための指針を提供する。

提案手法

三つの視覚推論スキル（物体認識、物体カウント、空間関係の理解）を定義し、生成画像上のDETRベースの物体検出で測定する。
一様な物体/関係分布を用いたUnityベースの3Dシミュレータを用いてPaintSkillsを作成し、バイアスを回避する。
PaintSkillsのテスト分割でDETR検出器を訓練し、上限オラクル精度を得る。
バイアス分析（性別と職業）用の診断プロンプトを生成し、性別、肌色、属性を自動検出器（BLIP-2、FAN、TRUST）と人間の検証で検出する。
均一なベースラインに対する分布と平均絶対偏差（MAD）を用いて性別/肌色バイアスを定量化する。

実験結果

リサーチクエスチョン

RQ1現在のテキスト-to-画像モデルは、オラクルと比較して物体を数え、空間関係を理解する能力がどれくらいあるか。
RQ2職業関連の説明でプロンプトを与えたとき、テキスト-to-画像モデルは性別と肌色のバイアスを示すか。
RQ3自動検出器は生成画像の視覚推論とバイアス評価において人間の判断とどれほど一致するか。
RQ4学習データの要因は観測されたバイアスにどのように寄与しており、評価が改善の指針をどう提供できるか。

主な発見

Evaluator	Images	Object Recognition (%)	Object Counting (%)	Spatial Relation Understanding (%)	Avg. (%)
GT (oracle)	N/A	100.0	97.8	96.2	98.0
GT shuffled (random)	N/A	6.3	1.7	0.3	2.8
DALL-E Small	N/A	57.5	18.2	2.4	26.0
minDALL-E	N/A	89.9	47.5	50.7	62.7
Stable Diffusion	N/A	96.2	37.8	7.9	47.3

Stable Diffusion は物体認識精度が最も高い（96.2%）が、カウント（37.8%）と空間関係理解（7.9%）では遅れがあり、複雑な推論にギャップがある。
minDALL-E は物体カウント（47.5%）と空間（50.7%）を Stable Diffusion よりはるかにうまくバランスさせつつ、物体認識（89.9%）では劣る。
DETRベースの評価はスキル全般で人間の判断と一致し、自動的な指標アプローチの妥当性を支持する。
モデルは職業ごとに異なる性別バイアスを示し、プロンプトで男性表現に偏りが見られ、モデル間（minDALL-E、Karlo、Stable Diffusion）で偏りが異なる。
肌色バイアスはモデル間で中程度のMST値（5-6）周辺に集中する傾向を示し、MADスコアは分布が非一様であることを示している。
PaintSkillsデータセットの規模は、部分データ（50-100%）でもスキル学習に十分であり、評価フレームワークの頑健性を示唆する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。