QUICK REVIEW

[논문 리뷰] Re-Imagen: Retrieval-Augmented Text-to-Image Generator

Wenhu Chen, Hexiang Hu|arXiv (Cornell University)|2022. 09. 29.

Multimodal Machine Learning Applications인용 수 44

한 줄 요약

Re-Imagen은 외부 멀티모달 참조를 검색하여 텍스트-이미지 확산을 지면 grounding 하여 희귀하거나 보지 못한 엔티티에 대한 충실도를 향상시키고 표준 벤치마크에서 강한 FID/grounding 및 새로운 EntityDrawBench 데이터셋을 달성합니다.

ABSTRACT

Research on text-to-image generation has witnessed significant progress in generating diverse and photo-realistic images, driven by diffusion and auto-regressive models trained on large-scale image-text data. Though state-of-the-art models can generate high-quality images of common entities, they often have difficulty generating images of uncommon entities, such as `Chortai (dog)' or `Picarones (food)'. To tackle this issue, we present the Retrieval-Augmented Text-to-Image Generator (Re-Imagen), a generative model that uses retrieved information to produce high-fidelity and faithful images, even for rare or unseen entities. Given a text prompt, Re-Imagen accesses an external multi-modal knowledge base to retrieve relevant (image, text) pairs and uses them as references to generate the image. With this retrieval step, Re-Imagen is augmented with the knowledge of high-level semantics and low-level visual details of the mentioned entities, and thus improves its accuracy in generating the entities' visual appearances. We train Re-Imagen on a constructed dataset containing (image, text, retrieval) triples to teach the model to ground on both text prompt and retrieval. Furthermore, we develop a new sampling strategy to interleave the classifier-free guidance for text and retrieval conditions to balance the text and retrieval alignment. Re-Imagen achieves significant gain on FID score over COCO and WikiImage. To further evaluate the capabilities of the model, we introduce EntityDrawBench, a new benchmark that evaluates image generation for diverse entities, from frequent to rare, across multiple object categories including dogs, foods, landmarks, birds, and characters. Human evaluation on EntityDrawBench shows that Re-Imagen can significantly improve the fidelity of generated images, especially on less frequent entities.

연구 동기 및 목표

희귀하거나 보이지 않는 엔티에 대해도 충실성을 유지하는 로버스트한 텍스트-투-이미지 생성을 고무한다.
암기에 의존하기보다 외부 멀티모달 지식을 활용해 시각적 모습을 근거화한다.
텍스트와 검색 가이던스를 통합한 학습 체계와 샘플링 전략을 개발한다.
표준 벤치마크와 롱테일 엔티티 프롬프트에 걸쳐 근거화와 현실감을 평가한다.

제안 방법

세 가지 생성 단계로 고해상도 이미지를 생성하기 위해 (64×, 256×, 1024×)의 계단식 확산 아키텍처를 사용한다.
입력 프롬프트를 질의로 삼아 외부 멀티모달 지식 기반에서 상위 k 개의 이미지-텍스트 쌍을 검색한다( BM25 또는 CLIP 기반 유사도 ).
검색된 <image, text> 참조를 인코딩하고 이를 denoising U-Net에 교차 주의 메커니즘으로 통합한다.
샘플링 중 텍스트 가이던스와 검색 가이던스를 균형 있게 조정하기 위해 간헐적으로 classifier-free 가이던스를 적용한다(두 개의 조정된 엡실론 예측과 혼합 비율).
ImageText 데이터에서 파생된 KNN-ImageText 데이터셋으로 학습하며 상위 k 이웃을 검색 결과로 사용하고 조건을 무작위로 드롭해 주변화된 디노이징을 학습한다.
COCO/WikiImages에서 제로샷 FID를 평가하고 새로운 EntityDrawBench에서 사람 평가를 통해 충실도와 사진실재감을 측정한다.

실험 결과

연구 질문

RQ1Retrieval-augmented conditioning이 텍스트-이미지 생성에서 희귀하거나 보지 못한 엔티티에 대한 충실성을 향상시킬 수 있는가?
RQ2외부 멀티모달 지식 활용이 표준 이미지 품질 지표(FID)와 엔티티 충실성에 어떤 영향을 미치는가?
RQ3검색 품질, 검색 수, 가이던스 균형이 일반 엔티티와 희귀 엔티티의 결과에 어떤 영향을 주는가?
RQ4상호 간섭 가이던스가 텍스트 정렬성과 엔티티 grounding 사이의 더 나은 트레이드오프를 제공하는가?

주요 결과

검색 증강 생성이 Imagen과 같은 강력한 베이스라인에 비해 COCO 및 WikiImages에서 상당한 FID 향상을 보인다.
검색된 참조에 대한 grounding이 텍스트 프롬프트와 참조 엔티티 모두에 대한 충실성을 향상시키며 특히 덜 자주 등장하는 엔티티에서 두드러진다.
EntityDrawBench 인간 평가에서 Re-Imagen은 다양한 엔티티 유형(개, 음식, 명소, 새, 캐릭터)에 대해 다른 경쟁 모델보다 더 높은 충실도를 달성한다.
검색 이웃의 수(K)를 늘리면 희귀 엔티티에서 성능이 더 크게 향상되어 검색 grounding이 꼬리 프롬프트에 특히 도움이 됨을 시사한다.
상호 간섭 가이던스는 텍스트 정렬성과 엔티티 충실성 사이의 제어 가능한 트레이드오프를 제공하며, 균등 가중치 η ≈ 0.5 정도의 스위트 스팟이 제시된다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.