Skip to main content
QUICK REVIEW

[论文解读] T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to-image Generation

Kaiyi Huang, Duan, Chengqi|arXiv (Cornell University)|Jul 12, 2023
Multimodal Machine Learning Applications被引用 23
一句话总结

Introduces T2I-CompBench++ as a comprehensive 6,000-prompt benchmark for open-world compositional text-to-image generation, along with new evaluation metrics and a GORS fine-tuning approach to boost compositionality in diffusion models.

ABSTRACT

Despite the impressive advances in text-to-image models, they often struggle to effectively compose complex scenes with multiple objects, displaying various attributes and relationships. To address this challenge, we present T2I-CompBench++, an enhanced benchmark for compositional text-to-image generation. T2I-CompBench++ comprises 8,000 compositional text prompts categorized into four primary groups: attribute binding, object relationships, generative numeracy, and complex compositions. These are further divided into eight sub-categories, including newly introduced ones like 3D-spatial relationships and numeracy. In addition to the benchmark, we propose enhanced evaluation metrics designed to assess these diverse compositional challenges. These include a detection-based metric tailored for evaluating 3D-spatial relationships and numeracy, and an analysis leveraging Multimodal Large Language Models (MLLMs), i.e. GPT-4V, ShareGPT4v as evaluation metrics. Our experiments benchmark 11 text-to-image models, including state-of-the-art models, such as FLUX.1, SD3, DALLE-3, Pixart-$α$, and SD-XL on T2I-CompBench++. We also conduct comprehensive evaluations to validate the effectiveness of our metrics and explore the potential and limitations of MLLMs.

研究动机与目标

  • Define a comprehensive benchmark for open-world compositional text-to-image generation covering attribute binding, object relationships, and complex compositions.
  • Propose evaluation metrics tailored to compositional prompts and assess their correlation with human judgments.
  • Evaluate existing T2I models on the benchmark to identify strengths and limitations in compositionality.
  • Introduce GORS, a reward-driven fine-tuning approach to boost compositional generation in pretrained diffusion models.

提出的方法

  • Construct 6,000 prompts spanning three categories (attribute binding, object relationships, complex compositions) and six sub-categories (color, shape, texture, spatial, non-spatial, complex).
  • Propose category-specific evaluation metrics: disentangled BLIP-VQA for attribute binding, UniDet-based spatial relation metric, and a 3-in-1 metric for complex prompts; explore MiniGPT-4 CoT as an LLM-based probe.
  • Introduce GORS (Generative mOdel finetuning with Reward-driven Sample selection) to fine-tune Stable Diffusion v2 using reward-weighted losses based on alignment between prompts and generated images.
  • Use LoRA to fine-tune both CLIP text encoder and U-Net in a reinforcement-like setup where high-alignment samples are selected for training.
  • Benchmark six T2I models (including Stable Diffusion v1/v2, Composable Diffusion, Structured Diffusion, Attend-and-Excite) on the new benchmark and metrics.

实验结果

研究问题

  • RQ1How well do existing open-world compositional T2I models perform across attribute binding, object relationships, and complex compositions?
  • RQ2Can new, composition-specific evaluation metrics better align with human judgments than traditional CLIP/BLIP-based scores?
  • RQ3What is the effectiveness of reward-driven fine-tuning (GORS) for improving compositional generation without extensive retraining?
  • RQ4Do multimodal LLMs provide reliable unified evaluation signals for compositional T2I outputs?
  • RQ5What are the limitations and failure cases of current benchmarks and metrics for open-world compositional T2I?

主要发现

  • GORS consistently improves compositional performance across all categories, outperforming baselines on automatic and human evaluations.
  • Disentangled BLIP-VQA and UniDet-based metrics show higher correlation with human judgments than CLIP-based measures for attribute binding and spatial relations.
  • 3-in-1 metric provides a balanced evaluation for complex prompts by averaging CLIPScore, BLIP-VQA, and UniDet scores.
  • Stable Diffusion v2 generally outperforms v1-4 on compositional prompts, while some prior methods (e.g., Composable Diffusion) show limited gains on v2 baselines.
  • MiniGPT-4 with Chain-of-Thought offers potential as a unified evaluation signal, but current correlations with human judgments are limited compared to the proposed metrics.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。