QUICK REVIEW

[論文レビュー] Benchmarking Spatial Relationships in Text-to-Image Generation

Tejas Gokhale, Hamid Palangi|arXiv (Cornell University)|Dec 20, 2022

Multimodal Machine Learning Applications被引用数 25

ひとこと要約

本論文は VISOR というテキストから画像モデルの空間理解を評価する新しい自動指標と、2-object 空間関係を記述する 25,280 の文からなる大規模データセット SR 2D を紹介し、最新の T2I モデルをベンチマークします。写真写実性が空間的正確さを意味しないことを示し、物体生成と関係描写における顕著な偏りを明らかにします。

ABSTRACT

Spatial understanding is a fundamental aspect of computer vision and integral for human-level reasoning about images, making it an important component for grounded language understanding. While recent text-to-image synthesis (T2I) models have shown unprecedented improvements in photorealism, it is unclear whether they have reliable spatial understanding capabilities. We investigate the ability of T2I models to generate correct spatial relationships among objects and present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image. To benchmark existing models, we introduce a dataset, $\mathrm{SR}_{2D}$, that contains sentences describing two or more objects and the spatial relationships between them. We construct an automated evaluation pipeline to recognize objects and their spatial relationships, and employ it in a large-scale evaluation of T2I models. Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them. Our analyses demonstrate several biases and artifacts of T2I models such as the difficulty with generating multiple objects, a bias towards generating the first object mentioned, spatially inconsistent outputs for equivalent relationships, and a correlation between object co-occurrence and spatial understanding capabilities. We conduct a human study that shows the alignment between VISOR and human judgement about spatial understanding. We offer the $\mathrm{SR}_{2D}$ dataset and the VISOR metric to the community in support of T2I reasoning research.

研究の動機と目的

現代のテキストから画像モデルが、テキストプロンプトに記述された空間関係を正確に描画できるかを評価する。
T2I 出力における空間理解を定量化する、自動化され、かつ人間と整合する指標を提供する。
一般的な物体と 2D 空間関係を捉えた大規模データセット（SR 2D）を作成し、モデルをベンチマークする。
物体の同時出現と空間理解の間の相関、偏り、失敗モードを調査する。

提案手法

生成された画像における物体間の空間関係を検証する VISOR 指標を定義する。
80 MS-COCO オブジェクト間の left/right/above/below 関係を記述する 25,280 のプロンプトを用いて SR 2D データセットを構築する。
自動物体検出器（OWL-ViT with CLIP backbone）を用いて、生成された画像の物体を検出し、空間関係を推測する。
複数の有力 T2I モデル（GLIDE, DALLE-mini, CogView2, DALLE-v2, Stable Diffusion, and variants）を、プロンプトごとに4枚の画像に対してベンチマークする。
VISOR と空間理解に関する人間の判断との整合性を検証するため、人間研究（MTurk）を実施する。

実験結果

リサーチクエスチョン

RQ1最先端のテキストから画像モデルは、複数の物体間で指定された空間関係を信頼性高く再現できるか？
RQ2既存の自動的なマルチモーダル指標（例：CLIPScore、キャプションベースの指標）は、真の空間正確さとどの程度相関するか？
RQ3複数の物体とそれらの空間関係を生成する際の主要な失敗モードと偏りは何か？
RQ4VISOR は、T2I 出力に対する人間の空間理解判断をどれだけ反映するか？
RQ5空間描画性能に影響を与える要因（例：物体の同時出現、プロンプトの構造）は何か？

主な発見

すべてのモデルは強いフォトリアリズムを示すが、複数物体の関係に対する空間理解は弱い。
最良モデル（DALLE-v2）は VISOR uncond 約60%、VISOR 4 約7.5% を達成し、厳密な空間正確性には大きなギャップがあることを示している。
OA（object presence）は、両方の物体が出現する必要がある場合、多くのモデルで依然低く、物体生成と関係の正確性のギャップを浮き彫りにしている。
モデルは、最初に言及された物体を優先する、共起ペアでの成功率が高い、物体の統合などの偏りを示す。
VISOR は人間の判断と相関し、T2I モデルの空間推論を評価するうえで有用であることを裏付けている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。