QUICK REVIEW

[论文解读] ReVersion: Diffusion-Based Relation Inversion from Images

Ziqi Huang, Tianxing Wu|arXiv (Cornell University)|Mar 23, 2023

Multimodal Machine Learning Applications被引用 8

一句话总结

ReVersion 从范例图像中学习关系提示，通过扩散模型反演，结合位置前提先验和关系聚焦采样，生成对象通过提取的关系进行交互的新场景。

ABSTRACT

Diffusion models gain increasing popularity for their generative capabilities. Recently, there have been surging needs to generate customized images by inverting diffusion models from exemplar images, and existing inversion methods mainly focus on capturing object appearances (i.e., the "look"). However, how to invert object relations, another important pillar in the visual world, remains unexplored. In this work, we propose the Relation Inversion task, which aims to learn a specific relation (represented as "relation prompt") from exemplar images. Specifically, we learn a relation prompt with a frozen pre-trained text-to-image diffusion model. The learned relation prompt can then be applied to generate relation-specific images with new objects, backgrounds, and styles. To tackle the Relation Inversion task, we propose the ReVersion Framework. Specifically, we propose a novel "relation-steering contrastive learning" scheme to steer the relation prompt towards relation-dense regions, and disentangle it away from object appearances. We further devise "relation-focal importance sampling" to emphasize high-level interactions over low-level appearances (e.g., texture, color). To comprehensively evaluate this new task, we contribute the ReVersion Benchmark, which provides various exemplar images with diverse relations. Extensive experiments validate the superiority of our approach over existing methods across a wide range of visual relations. Our proposed task and method could be good inspirations for future research in various domains like generative inversion, few-shot learning, and visual relation detection.

研究动机与目标

研究一个新的关系反演问题：在范例图像中存在共同关系。
在冻结的预训练扩散模型的文本嵌入空间中学习关系提示。
将关系提示与对象外观解耦，以实现灵活的、关系驱动的图像合成。
提出 ReVersion 基准，用于全面评估关系反演。

提出的方法

引入前置词先验，将关系提示引导到文本嵌入中的一个关系密集子空间。
开发关系引导对比学习方案，将关系提示拉向基础前置词并远离非前置词的词语。
使用改进的负样本包括范例对象描述，以防止外观泄露。
应用关系聚焦的重要性采样，通过向更高噪声阶段偏移去噪，突出高层交互。
通过联合目标函数优化关系提示，结合引导损失和对噪声鲁棒的去噪损失。

实验结果

研究问题

RQ1是否可以从具有共同关系的范例图像中学习关系提示，然后将其应用于生成具有新对象的新场景？
RQ2结合前置词先验和对比引导是否能在解耦外观的同时提升对高层关系的提取？
RQ3关系聚焦的重要性采样是否在扩散反演中增强对高层交互的关注？
RQ4学习到的关系提示对新实体和背景的泛化能力如何？
RQ5提出的 ReVersion 组件对生成图像中的关系和实体准确性有何影响？

主要发现

该框架学习了一个关系提示，使得在提取的关系下实体之间可交互地生成新场景。
前置词先验和关系引导提高了关系与外观的解耦，减少了范例实体的外观泄露。
关系聚焦的重要性采样将优化偏向高层交互，提升关系准确性和实体真实性。
定性与定量评估显示在关系反演任务上优于基线文本到图像生成和文本反演方法。
专门的 ReVersion 基准提供多样的范例图像和模板，以评估关系反演任务。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。