QUICK REVIEW

[論文レビュー] Pick-a-Pic: An Open Dataset of User Preferences for Text-to-Image Generation

Yuval Kirstain, Adam Polyak|arXiv (Cornell University)|May 2, 2023

Data Visualization and Analytics被引用数 41

ひとこと要約

著者は Pick-a-Pic を公開、大規模なオープンデータセットを通じて実世界のユーザーのプロンプトと画像好みを収集し、PickScore を訓練します。PickScore は CLIP ベースのスコアリング関数で、人間の好みを超人的な精度で予測し、既存の指標よりもモデル評価を改善します。

ABSTRACT

The ability to collect a large dataset of human preferences from text-to-image users is usually limited to companies, making such datasets inaccessible to the public. To address this issue, we create a web app that enables text-to-image users to generate images and specify their preferences. Using this web app we build Pick-a-Pic, a large, open dataset of text-to-image prompts and real users' preferences over generated images. We leverage this dataset to train a CLIP-based scoring function, PickScore, which exhibits superhuman performance on the task of predicting human preferences. Then, we test PickScore's ability to perform model evaluation and observe that it correlates better with human rankings than other automatic evaluation metrics. Therefore, we recommend using PickScore for evaluating future text-to-image generation models, and using Pick-a-Pic prompts as a more relevant dataset than MS-COCO. Finally, we demonstrate how PickScore can enhance existing text-to-image models via ranking.

研究の動機と目的

Create a large, open dataset of real user prompts and preferences in text-to-image generation.
Train a scoring function (PickScore) to predict user preferences from prompt–image pairs.
Demonstrate PickScore’s superiority for model evaluation over traditional metrics like FID.
Show how ranking with PickScore can improve image quality and model selection.
Encourage adoption of Pick-a-Pic and PickScore for future T2I research and evaluation.

提案手法

Develop a web app that lets users generate images from prompts and indicate preferences between two images per round.
Collect and preprocess real user interactions to form the Pick-a-Pic dataset with prompts, image pairs, and preferences.
Fine-tune a CLIP-H backbone with a reward-model–style objective to maximize the likelihood of preferred images (KL-divergence minimization).
Train PickScore by combining a CLIP-based text/image encoder with a reward-model objective and a temperature parameter, using a dataset-weighted loss to handle prompt frequency.
Evaluate PickScore against baselines (CLIP-H, aesthetics predictor, random, and human experts) using adapted accuracy with ties and tie-threshold analysis.
Compare model evaluation signals (PickScore vs FID) and show correlations with human judgments on MS-COCO captions.]
research_questions:[
Can Pick-a-Pic provide a realistic, open distribution of prompts and preferences reflecting real user needs?
Does PickScore reliably predict human preferences better than existing automatic scoring functions?
Is PickScore a better metric for evaluating and ranking text-to-image models than FID or other baselines?
Can ranking via PickScore meaningfully improve the quality of generated images compared to other scoring methods?

実験結果

リサーチクエスチョン

RQ1Can Pick-a-Pic provide a realistic, open distribution of prompts and preferences reflecting real user needs?
RQ2Does PickScore reliably predict human preferences better than existing automatic scoring functions?
RQ3Is PickScore a better metric for evaluating and ranking text-to-image models than FID or other baselines?
RQ4Can ranking via PickScore meaningfully improve the quality of generated images compared to other scoring methods?

主な発見

Model	Accuracy
Random	56.8
Human Expert	68.0
Aesthetics [14]	56.8
CLIP-H [7]	60.8
ImageReward [18]	61.1
HPS [17]	66.7
PickScore (Ours)	70.5

Pick-a-Pic delivers a large open dataset with over 500k examples (and later versions over 1M) of prompts, image pairs, and user preferences.
PickScore achieves 70.5% accuracy in predicting human preferences on the validation/test sets, exceeding human annotators (68.0%) and baselines.
PickScore shows a strong correlation with real user judgments (0.917 vs. FID’s -0.900 correlation on MS-COCO prompts).
PickScore outperforms CLIP-H, an aesthetics predictor, and ImageReward/HPS baselines on the Pick-a-Pic evaluation.
Using prompts from Pick-a-Pic yields more human-aligned model evaluation than MS-COCO captions.
In image ranking experiments, PickScore-selected images are preferred by humans more often than selections by other scoring methods.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。