QUICK REVIEW

[论文解读] T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation

Kaiyi Huang, Duan, Chengqi|arXiv (Cornell University)|Jul 12, 2023

Multimodal Machine Learning Applications被引用 23

一句话总结

Introduces T2I-CompBench++ as a comprehensive 6,000-prompt benchmark for open-world compositional text-to-image generation, along with new evaluation metrics and a GORS fine-tuning approach to boost compositionality in diffusion models.

ABSTRACT

Despite the impressive advances in text-to-image models, they often struggle to effectively compose complex scenes with multiple objects, displaying various attributes and relationships. To address this challenge, we present T2I-CompBench++, an enhanced benchmark for compositional text-to-image generation. T2I-CompBench++ comprises 8,000 compositional text prompts categorized into four primary groups: attribute binding, object relationships, generative numeracy, and complex compositions. These are further divided into eight sub-categories, including newly introduced ones like 3D-spatial relationships and numeracy. In addition to the benchmark, we propose enhanced evaluation metrics designed to assess these diverse compositional challenges. These include a detection-based metric tailored for evaluating 3D-spatial relationships and numeracy, and an analysis leveraging Multimodal Large Language Models (MLLMs), i.e. GPT-4V, ShareGPT4v as evaluation metrics. Our experiments benchmark 11 text-to-image models, including state-of-the-art models, such as FLUX.1, SD3, DALLE-3, Pixart-$α$, and SD-XL on T2I-CompBench++. We also conduct comprehensive evaluations to validate the effectiveness of our metrics and explore the potential and limitations of MLLMs.

研究动机与目标

Define a comprehensive benchmark for open-world compositional text-to-image generation covering attribute binding, object relationships, and complex compositions.
Propose evaluation metrics tailored to compositional prompts and assess their correlation with human judgments.
Evaluate existing T2I models on the benchmark to identify strengths and limitations in compositionality.
Introduce GORS, a reward-driven fine-tuning approach to boost compositional generation in pretrained diffusion models.

提出的方法

Construct 6,000 prompts spanning three categories (attribute binding, object relationships, complex compositions) and six sub-categories (color, shape, texture, spatial, non-spatial, complex).
Propose category-specific evaluation metrics: disentangled BLIP-VQA for attribute binding, UniDet-based spatial relation metric, and a 3-in-1 metric for complex prompts; explore MiniGPT-4 CoT as an LLM-based probe.
Introduce GORS (Generative mOdel finetuning with Reward-driven Sample selection) to fine-tune Stable Diffusion v2 using reward-weighted losses based on alignment between prompts and generated images.
Use LoRA to fine-tune both CLIP text encoder and U-Net in a reinforcement-like setup where high-alignment samples are selected for training.
Benchmark six T2I models (including Stable Diffusion v1/v2, Composable Diffusion, Structured Diffusion, Attend-and-Excite) on the new benchmark and metrics.

实验结果

研究问题

RQ1How well do existing open-world compositional T2I models perform across attribute binding, object relationships, and complex compositions?
RQ2Can new, composition-specific evaluation metrics better align with human judgments than traditional CLIP/BLIP-based scores?
RQ3What is the effectiveness of reward-driven fine-tuning (GORS) for improving compositional generation without extensive retraining?
RQ4Do multimodal LLMs provide reliable unified evaluation signals for compositional T2I outputs?
RQ5What are the limitations and failure cases of current benchmarks and metrics for open-world compositional T2I?

主要发现

GORS consistently improves compositional performance across all categories, outperforming baselines on automatic and human evaluations.
Disentangled BLIP-VQA and UniDet-based metrics show higher correlation with human judgments than CLIP-based measures for attribute binding and spatial relations.
3-in-1 metric provides a balanced evaluation for complex prompts by averaging CLIPScore, BLIP-VQA, and UniDet scores.
Stable Diffusion v2 generally outperforms v1-4 on compositional prompts, while some prior methods (e.g., Composable Diffusion) show limited gains on v2 baselines.
MiniGPT-4 with Chain-of-Thought offers potential as a unified evaluation signal, but current correlations with human judgments are limited compared to the proposed metrics.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。