QUICK REVIEW

[論文レビュー] Holistic Evaluation of Text-To-Image Models

Tong Lee, Michihiro Yasunaga|arXiv (Cornell University)|Nov 7, 2023

Multimedia Communication and Technology被引用数 13

ひとこと要約

本論文は HEIM を紹介する、26 の text-to-image モデルを 12 の側面で評価し、62 のシナリオと人間および自動指標を用いた総合ベンチマークで、モデル間に多様な強みが見られ、人間と自動指標の間の相関は一般に弱いことを明らかにする。

ABSTRACT

The stunning qualitative improvement of recent text-to-image models has led to their widespread attention and adoption. However, we lack a comprehensive quantitative understanding of their capabilities and risks. To fill this gap, we introduce a new benchmark, Holistic Evaluation of Text-to-Image Models (HEIM). Whereas previous evaluations focus mostly on text-image alignment and image quality, we identify 12 aspects, including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. We curate 62 scenarios encompassing these aspects and evaluate 26 state-of-the-art text-to-image models on this benchmark. Our results reveal that no single model excels in all aspects, with different models demonstrating different strengths. We release the generated images and human evaluation results for full transparency at https://crfm.stanford.edu/heim/v1.1.0 and the code at https://github.com/stanford-crfm/helm, which is integrated with the HELM codebase.

研究の動機と目的

画像品質と整合性を超えたテキストから画像へのモデルを評価する総合ベンチマークを確立する。
実用的な展開を念頭に、バイアス、毒性、公正性、多言語性、効率性を含む 12 次元を評価する。
標準化され透明性のある評価を提供するために、多数のモデルとシナリオに対して。
現実的で多様なプロンプティング課題に対する異なるモデルの性能に関する示唆を提供する。

提案手法

広範な能力とリスクを網羅するため、12 の評価側面と 62 の prompting シナリオ（プロンプトと参照）を定義する。
各側面を評価するために、25 の指標を用い、人的判断（クラウドソーシング）と自動指標を組み合わせる。
ゼロショット prompting と一般的な適応戦略の下で、26 の最近の text-to-image モデルに対して評価を標準化する。
独創性、美学、多言語性、頑健性、効率性など未検討の側面について新しいシナリオと指標を作成する。
透明性と再現性のために、生成画像、人間の結果、評価コードを公開する。

Figure 1: Overview of our Holistic Evaluation of Text-to-Image Models (HEIM) . While existing benchmarks focus on limited aspects such as image quality and alignment with text, rely on automated metrics that may not accurately reflect human judgment, and evaluate limited models, HEIM takes a holisti

実験結果

リサーチクエスチョン

RQ1最先端の text-to-image モデルは、従来の整合性や品質を超えた広範な総合的側面でどのように性能を示すか？
RQ2これらのモデルを評価する際の人間の判断と自動指標の関係はどうなるか？
RQ3さまざまな側面で強みや弱みを示すモデルはどれか、また現在の能力から生じる倫理的・社会的リスクは何か？
RQ4多言語性、頑健性、効率性といった要因は実用的な展開にどのように影響するか。）

主な発見

1つのモデルがすべての側面で卓越しているわけではない。モデル間で異なる強みがある（例：整合性には DALL-E 2、美学には Openjourney、偏り/毒性緩和には minDALL-E と Safe Stable Diffusion）。
人間と自動指標の相関は一般に弱く、特にフォトリアリズムと美学ではそうである。人間評価の価値を強調する。
いくつかの側面はより注目が必要：推論能力と多言語性は他に比べ後れ、独創性、毒性、バイアスは倫理的/法的懸念を引き起こす。
プロンプト設計は視覚的魅力を高め、Promptist+Stable Diffusion は美学で優れている一方、整合性を維持する。
美術分野に特化した微調整モデルは美学やリアリズムに優れるが、整合性やバイアス緩和など他の側面を犠牲にすることがある。
DALL-E 2 は人間と整合した性能でしばし主導することが多いが、すべての社会的または多言語的側面で支配的なモデルはなく、異なるモデルが補完的な強みを提供する。

Figure 2: Standardized evaluation . Prior to HEIM ( top panel ), the evaluation of image generation models was not comprehensive: six of our 12 core aspects were not evaluated on existing models, and only 11% of the total evaluation space was studied (the percentage of ✓in the matrix of aspects $\ti

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。