QUICK REVIEW

[论文解读] X-IQE: eXplainable Image Quality Evaluation for Text-to-Image Generation with Visual Large Language Models

Yixiong Chen, Liu, Li|arXiv (Cornell University)|May 18, 2023

Multimodal Machine Learning Applications被引用 7

一句话总结

X-IQE 使用视觉 LLM（MiniGPT-4 与 Vicuna）生成对文本到图像生成的可解释文本解释，评估保真度、对齐和美学，而无需训练。

ABSTRACT

This paper introduces a novel explainable image quality evaluation approach called X-IQE, which leverages visual large language models (LLMs) to evaluate text-to-image generation methods by generating textual explanations. X-IQE utilizes a hierarchical Chain of Thought (CoT) to enable MiniGPT-4 to produce self-consistent, unbiased texts that are highly correlated with human evaluation. It offers several advantages, including the ability to distinguish between real and generated images, evaluate text-image alignment, and assess image aesthetics without requiring model training or fine-tuning. X-IQE is more cost-effective and efficient compared to human evaluation, while significantly enhancing the transparency and explainability of deep image quality evaluation models. We validate the effectiveness of our method as a benchmark using images generated by prevalent diffusion models. X-IQE demonstrates similar performance to state-of-the-art (SOTA) evaluation methods on COCO Caption, while overcoming the limitations of previous evaluation models on DrawBench, particularly in handling ambiguous generation prompts and text recognition in generated images. Project website: https://github.com/Schuture/Benchmarking-Awesome-Diffusion-Models

研究动机与目标

Motivate the need for cheap, generalizable, and explainable image quality evaluation beyond human or traditional model-based scores.
Propose an explainable, training-free evaluation framework using visual LLMs to analyze fidelity, alignment, and aesthetics of AI-generated images.
Incorporate expert-driven prompt design and hierarchical chain-of-thought to achieve unbiased, coherent explanations.
Validate X-IQE as a benchmark across real and AI-generated images and compare with state-of-the-art metrics.

提出的方法

Utilize MiniGPT-4 (ViT-based encoder + Vicuna) as the core visual-LM with in-context learning for evaluation without additional training.
Design expert-informed prompts that encode art-professional criteria for image quality analysis.
Apply a hierarchical chain-of-thought (CoT) flow: fidelity evaluation informs alignment evaluation, which informs aesthetics evaluation, with shared image description across tasks.
Enforce JSON output format and explicit scoring conditions to stabilize CoT-compliant responses.
Incorporate a dedicated CoT within-task and between-task reasoning to improve consistency and reuse prior analysis.

实验结果

研究问题

RQ1Can a pre-trained visual LLM provide reliable, explainable evaluations of fidelity, alignment, and aesthetics for text-to-image generation without fine-tuning?
RQ2Does a hierarchical CoT prompting strategy yield results that correlate with human judgments better than traditional metrics like CLIPScore or aesthetic predictors?
RQ3How does model size and temperature affect the stability and consistency of X-IQE’s evaluations?
RQ4Can X-IQE discriminate real versus AI-generated images and serve as a robust benchmark across multiple diffusion models and prompts?

主要发现

X-IQE achieves correlations with human judgments that are competitive with or surpass some task-specific models on COCO Caption data.
The hierarchical CoT with expert-informed prompts improves evaluation quality and consistency over baselines that directly ask for scores without reasoning.
X-IQE can reliably distinguish real and AI-generated images through fidelity distributions and related qualitative examples.
X-IQE demonstrates robust alignment and aesthetics scoring that correlates with human evaluations more than CLIPScore and Aesthetic Predictor in the tested datasets.
Larger models (13B Vicuna) and controlled temperature (0.1) yield more stable and accurate evaluations, supporting the use of larger visual-LM backbones for this task.
X-IQE provides a transparent, training-free benchmarking framework that can compare multiple SOTA text-to-image models (e.g., Stable Diffusion variants, Openjourney, DeepFloyd-IF).

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。