QUICK REVIEW

[論文レビュー] JourneyDB: A Benchmark for Generative Image Understanding

Keqiang Sun, Junting Pan|arXiv (Cornell University)|Jul 3, 2023

Multimodal Machine Learning Applications被引用数 11

ひとこと要約

JourneyDB は、4M の画像-プロンプトの組と4つのタスク（プロンプト反転、スタイル検索、画像キャプション生成、VQA）を含む大規模な生成画像ベンチマークを導入し、外部モデルのサブセットを追加して、AI生成コンテンツのマルチモーダル理解を評価・改善します。

ABSTRACT

While recent advancements in vision-language models have had a transformative impact on multi-modal comprehension, the extent to which these models possess the ability to comprehend generated images remains uncertain. Synthetic images, in comparison to real data, encompass a higher level of diversity in terms of both content and style, thereby presenting significant challenges for the models to fully grasp. In light of this challenge, we introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images within the context of multi-modal visual understanding. Our meticulously curated dataset comprises 4 million distinct and high-quality generated images, each paired with the corresponding text prompts that were employed in their creation. Furthermore, we additionally introduce an external subset with results of another 22 text-to-image generative models, which makes JourneyDB a comprehensive benchmark for evaluating the comprehension of generated images. On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension in relation to both content and style interpretation. These benchmarks encompass prompt inversion, style retrieval, image captioning, and visual question answering. Lastly, we evaluate the performance of state-of-the-art multi-modal models when applied to the JourneyDB dataset, providing a comprehensive analysis of their strengths and limitations in comprehending generated content. We anticipate that the proposed dataset and benchmarks will facilitate further research in the field of generative content understanding. The dataset is publicly available at https://journeydb.github.io.

研究の動機と目的

生成コンテンツ理解を研究するために、対応するプロンプトを備えた生成画像の大規模データセットを作成する。
コンテンツとスタイルの理解を評価するために、プロンプト反転、スタイル検索、画像キャプション生成、Visual Question Answeringの4つのベンチマークを確立する。
他のテキスト-to-イメージモデルからの外部サブセットを含め、データセット間の評価を可能にする。
生成コンテンツに対する最先端のマルチモーダルモデルの性能を評価し、利点と限界を特定する。
生成コンテンツ理解の研究を進めるための公にアクセス可能なリソースを提供する。

提案手法

Midjourney Discord のプロンプトをクローリングして生成画像とプロンプトを収集し、コンテンツを多様化するために22個の追加のテキスト-to-イメージモデルを追加する。
GPT-3.5 を用いて、プロンプトをスタイルと内容に分割し、キャプションを生成し、スタイルおよび内容に関連する問いを回答選択肢付きで作成する。
大規模なスタイル空間を334カテゴリにクラスタリングしてスタイル検索を促進し、スタイルサブ空間でCLIPベースのゼロショット検索を評価する。
プロンプト反転、スタイル検索、画像キャプション生成、ゼロショットVQA（MC-VQA）を定義・実装して、コンテンツおよびスタイル理解能力を検証する。
JourneyDB 上で最新のマルチモーダルモデルのゼロショット評価とファインチューニング評価を実施し、生成コンテンツの扱いにおけるギャップと強みを明らかにする分析を行う。

実験結果

リサーチクエスチョン

RQ1生成画像から元のテキストプロンプトを推測できるか（プロンプト反転）？
RQ2生成画像全体でスタイリスティック属性をどれだけ正確に取得できるか（スタイル検索）？
RQ3生成画像をキャプション付けし、コンテンツおよびスタイル関連の質問にどの程度効果的に答えられるか（キャプション付けとVQA）？
RQ4実データで事前学習した現在のマルチモーダルモデルは生成コンテンツに一般化できるか、JourneyDB でのファインチューニングは性能にどう影響するか？

主な発見

最先端のマルチモーダルモデルは、実画像ベンチマークと比較してJourneyDBでパフォーマンスが低い。
JourneyDBでのファインチューニングは、タスクの性能を大幅に向上させる。
キャプショニングの結果、冗長な GPT-3.5-グラウンドトゥルースのキャプションとスタイリッシュな記述が既存モデルにとって課題となり、実画像データセットに比べスコアが低下する。
スタイル検索は、大規模なスタイル語彙をカテゴリ化されたクラスターに整理することで利点が生まれ、スタイルサブスペースの検索が改善される。
MC-VQA の正解率は、生成コンテンツに関する内容・スタイル関連の質問をモデルが扱う際にかなりの困難があることを示し、現在の能力のギャップを浮き彫りにしている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。