QUICK REVIEW

[논문 리뷰] JourneyDB: A Benchmark for Generative Image Understanding

Keqiang Sun, Junting Pan|arXiv (Cornell University)|2023. 07. 03.

Multimodal Machine Learning Applications인용 수 11

한 줄 요약

JourneyDB는 4M 이미지-프롬프트 쌍과 네 가지 작업(프롬프트 반전, 스타일 검색, 이미지 캡션 생성, VQA) 및 외부 모델 하위 집합을 포함하는 대규모 생성 이미지 벤치마크를 도입하여 AI 생성 콘텐츠의 다중모달 이해를 평가하고 향상시키는 목적이다.

ABSTRACT

While recent advancements in vision-language models have had a transformative impact on multi-modal comprehension, the extent to which these models possess the ability to comprehend generated images remains uncertain. Synthetic images, in comparison to real data, encompass a higher level of diversity in terms of both content and style, thereby presenting significant challenges for the models to fully grasp. In light of this challenge, we introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images within the context of multi-modal visual understanding. Our meticulously curated dataset comprises 4 million distinct and high-quality generated images, each paired with the corresponding text prompts that were employed in their creation. Furthermore, we additionally introduce an external subset with results of another 22 text-to-image generative models, which makes JourneyDB a comprehensive benchmark for evaluating the comprehension of generated images. On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension in relation to both content and style interpretation. These benchmarks encompass prompt inversion, style retrieval, image captioning, and visual question answering. Lastly, we evaluate the performance of state-of-the-art multi-modal models when applied to the JourneyDB dataset, providing a comprehensive analysis of their strengths and limitations in comprehending generated content. We anticipate that the proposed dataset and benchmarks will facilitate further research in the field of generative content understanding. The dataset is publicly available at https://journeydb.github.io.

연구 동기 및 목표

생성 콘텐츠 이해를 연구하기 위한 해당 프롬프트와 함께 생성된 대규모 이미지 데이터세트를 만든다.
콘텐츠 및 스타일 이해를 평가하기 위해 프롬프트 반전, 스타일 검색, 이미지 캡션 생성, 시각적 질문 응답의 네 가지 벤치마크를 수립한다.
다른 텍스트-이미지 모델의 외부 하위 집합을 포함하여 교차 데이터세트 평가를 가능하게 한다.
생성 콘텐츠에 대한 최첨단 다중모달 모델의 성능을 평가하고 그 강점과 한계를 식별한다.
생성 콘텐츠 이해 연구를 진전시키기 위한 공개적으로 접근 가능한 자원을 제공한다.

제안 방법

Midjourney Discord 프롬프트를 크롤링하고 콘텐츠 다양성을 높이기 위해 22개의 추가 텍스트-이미지 모델을 추가하여 생성된 이미지와 프롬프트를 수집한다.
GPT-3.5를 사용해 프롬프트를 스타일/콘텐츠로 분할하고 캡션을 생성하며 스타일 및 콘텐츠 관련 문제와 선택지를 생성하여 작업을 주석화한다.
스타일 검색을 용이하게 하기 위해 큰 스타일 공간을 334개 범주로 클러스터링하고 스타일 부분 공간에서 CLIP 기반 제로샷 검색으로 평가한다.
프롬프트 반전, 스타일 검색, 이미지 캡션 생성, 제로샷 VQA(MC-VQA) 등의 네 가지 벤치마크를 정의하고 구현하여 콘텐츠 및 스타일 이해 능력을 탐구한다.
JourneyDB에서 컨템포러리 다중모달 모델의 제로샷 및 미세조정 평가를 수행하고 생성 콘텐츠 처리의 격차와 강점을 드러내기 위한 분석을 수행한다.

실험 결과

연구 질문

RQ1모델이 생성된 이미지에서 원래의 텍스트 프롬프트를 추론할 수 있는가(프롬프트 반전)?
RQ2생성된 이미지 전반에서 스타일 특성을 검색할 수 있는가(스타일 검색)?
RQ3모델이 생성된 이미지를 캡션하고 콘텐츠 및 스타일 관련 질문에 답하는 데 얼마나 효과적인가(캡션 및 VQA)?
RQ4실데이터에 사전 학습된 현재 다중모달 모델이 생성 콘텐츠로 일반화되는가, JourneyDB에서의 미세조정이 성능에 어떤 영향을 주는가?

주요 결과

최첨단 다중모달 모델은 실제 이미지 벤치마크에 비해 JourneyDB에서 성능이 저조하다.
JourneyDB에 대한 미세조정은 태스크 성능을 크게 향상시킨다.
캡션 성능은 길이가 긴 GPT-3.5-실제 캡션 및 스타일 설명이 기존 모델의 도전을 제기하여 실제 이미지 데이터셋에 비해 점수가 감소한다.
스타일 검색은 스타일 어휘를 대량으로 분류된 클러스터로 구성하는 것이 이점이 있어 스타일 부분 공간에서의 검색을 개선한다.
MC-VQA 정확도는 생성 콘텐츠에 대한 콘텐츠 및 스타일 관련 질문 처리의 상당한 난이도를 나타내며 현재 능력의 격차를 강조한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.