QUICK REVIEW

[论文解读] Re-Imagen: Retrieval-Augmented Text-to-Image Generator

Wenhu Chen, Hexiang Hu|arXiv (Cornell University)|Sep 29, 2022

Multimodal Machine Learning Applications被引用 44

一句话总结

Re-Imagen 检索外部多模态参考以为文本到图像扩散提供锚点，提高对罕见或未见实体的保真度，并在标准基准和新实体绘制基准数据集上达到强的FID/定位。

ABSTRACT

Research on text-to-image generation has witnessed significant progress in generating diverse and photo-realistic images, driven by diffusion and auto-regressive models trained on large-scale image-text data. Though state-of-the-art models can generate high-quality images of common entities, they often have difficulty generating images of uncommon entities, such as `Chortai (dog)' or `Picarones (food)'. To tackle this issue, we present the Retrieval-Augmented Text-to-Image Generator (Re-Imagen), a generative model that uses retrieved information to produce high-fidelity and faithful images, even for rare or unseen entities. Given a text prompt, Re-Imagen accesses an external multi-modal knowledge base to retrieve relevant (image, text) pairs and uses them as references to generate the image. With this retrieval step, Re-Imagen is augmented with the knowledge of high-level semantics and low-level visual details of the mentioned entities, and thus improves its accuracy in generating the entities' visual appearances. We train Re-Imagen on a constructed dataset containing (image, text, retrieval) triples to teach the model to ground on both text prompt and retrieval. Furthermore, we develop a new sampling strategy to interleave the classifier-free guidance for text and retrieval conditions to balance the text and retrieval alignment. Re-Imagen achieves significant gain on FID score over COCO and WikiImage. To further evaluate the capabilities of the model, we introduce EntityDrawBench, a new benchmark that evaluates image generation for diverse entities, from frequent to rare, across multiple object categories including dogs, foods, landmarks, birds, and characters. Human evaluation on EntityDrawBench shows that Re-Imagen can significantly improve the fidelity of generated images, especially on less frequent entities.

研究动机与目标

推动鲁棒的文本到图像生成，在罕见或未见实体上保持保真。
利用外部多模态知识来对视觉外观进行定位，而不是仅靠记忆。
开发将文本与检索引导整合的训练方案与采样策略。
在标准基准和长尾实体提示上评估定位性与真实感。

提出的方法

使用分级扩散架构（64×、256×、1024×），通过三个生成阶段来生成高分辨率图像。
使用输入提示作为查询从外部多模态知识库检索前k个图像-文本对（BM25或基于CLIP的相似度）。
对检索到的 <图像, 文本> 参考进行编码，并通过跨注意力机制将其整合进去至去噪U-Net。
在采样阶段应用交错的无条件引导以平衡文本引导与检索引导（两次调整的 epsilon 预测和一个混合比）。
在基于 ImageText 数据派生的 KNN-ImageText 数据集上训练，以前k个邻居作为检索，并随机丢弃条件以学习边缘化去噪。
在 COCO/WikiImages 上进行零样本 FID 评估，并在新的 EntityDrawBench 上进行人类评估以衡量保真度和照片真实感。

实验结果

研究问题

RQ1检索增强条件是否能提高文本到图像生成中对罕见或未见实体的保真度？
RQ2外部多模态知识定位如何影响标准图像质量指标（如 FID）和实体保真度？
RQ3检索质量、检索数量和引导平衡对常见实体与罕见实体的结果有何影响？
RQ4交错引导是否相比传统的单条件引导提供更好的文本对齐与实体定位之间的权衡？

主要发现

与 Imagen 等强基线相比，检索增强生成在 COCO 和 WikiImages 上带来显著的 FID 提升。
对检索到的参考进行定位有助于提升对文本提示和所引用实体的保真度，特别是对不太常见的实体。
EntityDrawBench 人类评估显示 Re-Imagen 在多样化实体类型（犬类、食品、地标、鸟类、角色）上的保真度高于竞争模型。
增加检索邻居数量（K）在罕见实体上带来更明显的性能提升，表明检索定位对尾部提示特别有帮助。
交错引导在文本对齐与实体保真之间提供可控的权衡，建议的最佳点在权重大致相等时（η ≈ 0.5）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。