QUICK REVIEW

[論文レビュー] A Survey on Quality Metrics for Text-to-Image Generation

Sebastian Hartwig, Dominik Engel|arXiv (Cornell University)|Mar 18, 2024

Computer Graphics and Visualization Techniques被引用数 5

ひとこと要約

この調査はテキストから画像への評価指標をレビューし、構成的品質と一般的な画像品質に基づく分類を導入し、埋め込みベースとコンテンツベースのアプローチおよびガイドラインと未解決の課題について論じる。

ABSTRACT

AI-based text-to-image models do not only excel at generating realistic images, they also give designers more and more fine-grained control over the image content. Consequently, these approaches have gathered increased attention within the computer graphics research community, which has been historically devoted towards traditional rendering techniques, that offer precise control over scene parameters (e.g., objects, materials, and lighting). While the quality of conventionally rendered images is assessed through well established image quality metrics, such as SSIM or PSNR, the unique challenges of text-to-image generation require other, dedicated quality metrics. These metrics must be able to not only measure overall image quality, but also how well images reflect given text prompts, whereby the control of scene and rendering parameters is interweaved. Within this survey, we provide a comprehensive overview of such text-to-image quality metrics, and propose a taxonomy to categorize these metrics. Our taxonomy is grounded in the assumption, that there are two main quality criteria, namely compositional quality and general quality, that contribute to the overall image quality. Besides the metrics, this survey covers dedicated text-to-image benchmark datasets, over which the metrics are frequently computed. Finally, we identify limitations and open challenges in the field of text-to-image generation, and derive guidelines for practitioners conducting text-to-image evaluation.

研究の動機と目的

人間の判断と一致する堅牢なT2I評価指標の必要性を動機付ける。
画像ベースの指標とテキスト条件付き指標を区別するT2I品質指標の分類を提示する。
埋め込みベースとコンテンツベースの整合性指標とそれらのT2I評価における役割をレビューする。
未解決の課題を論じ、T2Iシステムの評価フレームワークを改善するためのガイドラインを提供する。

提案手法

構成的品質と一般的な画像品質に基づくT2Iメトリクスの分類法を提案する。
埋め込みベースとコンテンツベースのテキスト-画像整合アプローチにメトリクスを分類する。
Vision-Languageプリトレーニング（例：CLIP、BLIP）が埋め込みベースの指標にどのように影響するかを分析する。
画像のみの指標と分布ベース対単一画像指標がT2I品質にどう関連するかを議論する。
指標を人間の判断にマッピングし、人間の研究を通じた検証を強調する。

実験結果

リサーチクエスチョン

RQ1T2I評価で人間の判断を最もよく反映する純粋な構成的品質と一般的品質というコア次元は何か？
RQ2埋め込みベースとコンテンツベースの指標は、T2I出力のテキスト-画像整合を捉える際にどの程度比較できるか？
RQ3テキスト条件付き画像生成システムを評価する際の未解決課題と実践的ガイドラインは何か？

主な発見

純粋な画像ベースの指標とテキスト条件付き指標を区別する分類法で、構成的品質と一般的画像品質に焦点を当てる。
埋め込みベースの指標（例：CLIP由来のスコア）はテキスト-画像整合において顕著だが、ニュアンスのある人間の判断には微調整が必要な場合がある。
コンテンツベースの指標は、オブジェクトの正確さや空間・属性関係といった明示的な内容忠実度を評価する。
ガイドラインと標準化された評価実践が必要で、一貫性を改善し人間の嗜好を捉える。
本調査は未解決の課題を強調し、T2Iモデルの評価メカニズムと最適化を進める方向性を示唆している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。