QUICK REVIEW

[论文解读] AnyDoor: Zero-shot Object-level Image Customization

Xi Chen, Lianghua Huang|arXiv (Cornell University)|Jul 18, 2023

Generative Adversarial Networks and Image Synthesis被引用 11

一句话总结

AnyDoor 是一个基于扩散的系统，通过对身份和细节特征进行编码并将其注入到预训练的扩散模型中，实现对目标对象的零-shot传送到用户指定的场景位置；系统使用视频和图像数据进行训练，以实现鲁棒的泛化。

ABSTRACT

This work presents AnyDoor, a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations in a harmonious way. Instead of tuning parameters for each object, our model is trained only once and effortlessly generalizes to diverse object-scene combinations at the inference stage. Such a challenging zero-shot setting requires an adequate characterization of a certain object. To this end, we complement the commonly used identity feature with detail features, which are carefully designed to maintain texture details yet allow versatile local variations (e.g., lighting, orientation, posture, etc.), supporting the object in favorably blending with different surroundings. We further propose to borrow knowledge from video datasets, where we can observe various forms (i.e., along the time axis) of a single object, leading to stronger model generalizability and robustness. Extensive experiments demonstrate the superiority of our approach over existing alternatives as well as its great potential in real-world applications, such as virtual try-on and object moving. Project page is https://damo-vilab.github.io/AnyDoor-Page/.

研究动机与目标

在图像中实现零-shot、身份保持的对象重定位的必要性。
提出使用身份标记和细节图来表示目标对象，以引导基于扩散的组合。
利用视频派生的外观变异和大规模图像数据来训练一个健壮且具有良好泛化能力的模型。
在推理阶段实现高保真、丰富多样的对象重定位，而无需对每个对象进行微调。

提出的方法

在去除背景后，使用自监督编码器（DINO-V2）提取的身份标记来表示目标对象。
通过基于 Sobel 的高通滤波器生成高频细节图，并采用拼贴式方法在保留纹理的同时允许变化。
通过跨注意力（ID）和特征拼接（细节）将ID标记和细节图注入Stable Diffusion，以实现引导。
使用成对的视频帧（同一对象在不同场景中）和多样化图像进行训练，以捕捉外观和场景变动。
使用自适应时间步采样，将来自视频数据的早期步（姿态/结构）与来自图像数据的晚期步（纹理）进行平衡。
推理阶段，对场景区域进行裁剪和调整大小，并应用放大策略以适应任意纵横比和区域大小。

实验结果

研究问题

RQ1零-shot 的扩散式生成是否能够在允许在场景中灵活放置的同时保持对象身份？
RQ2将身份表征以细节特征进行丰富是否能提升局部编辑中的身份一致性和纹理保真度？
RQ3将视频获得的外观变异与图像多样性结合，是否能提升对未见对象-场景对的泛化？
RQ4如何利用多模态数据的自适应训练策略提升对象重定位的真实感与一致性？

主要发现

模型	质量	保真度	多样性
Paint-by-Example [ 56 ]	2.71	2.10	3.04
Graphit [ 21 ]	2.65	2.11	2.84
AnyDoor (ours)	3.04	3.06	2.88

在用户研究中，AnyDoor 相对于基于参考的方法在保真度和身份保持方面表现更好（Quality/Fidelity/Diversity 指标更有利于 AnyDoor）。
在对比用户研究中，AnyDoor 在保真度与质量上高于 Paint-by-Example 与 Graphit，且多样性具有竞争力。
消融研究表明，使用 DINO-V2、高频细节图和自适应时间步采样都提升了与目标对象在 CLIP 与 DINO 相似度上的表现。
AnyDoor 支持多主体组合以及像虚拟试穿和物体移动/替换等实际应用，且无需针对每个对象进行微调。
在基于 DreamBooth 的基准上的定性和定量评估显示改进的对象身份保持和场景和谐。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。