QUICK REVIEW

[论文解读] PhotoVerse: Tuning-Free Image Customization with Text-to-Image Diffusion Models

Li Chen, Mengyi Zhao|arXiv (Cornell University)|Sep 11, 2023

Generative Adversarial Networks and Image Synthesis被引用 14

一句话总结

PhotoVerse 在单个参考图片的基础上实现无需微调的个性化文本到图像生成，通过双分支文本与视觉条件与面部身份损失来保持身份同时支持编辑。实现快速生成（约5秒），且无需在测试时进行调优，支持多样场景与风格。

ABSTRACT

Personalized text-to-image generation has emerged as a powerful and sought-after tool, empowering users to create customized images based on their specific concepts and prompts. However, existing approaches to personalization encounter multiple challenges, including long tuning times, large storage requirements, the necessity for multiple input images per identity, and limitations in preserving identity and editability. To address these obstacles, we present PhotoVerse, an innovative methodology that incorporates a dual-branch conditioning mechanism in both text and image domains, providing effective control over the image generation process. Furthermore, we introduce facial identity loss as a novel component to enhance the preservation of identity during training. Remarkably, our proposed PhotoVerse eliminates the need for test time tuning and relies solely on a single facial photo of the target identity, significantly reducing the resource cost associated with image generation. After a single training phase, our approach enables generating high-quality images within only a few seconds. Moreover, our method can produce diverse images that encompass various scenes and styles. The extensive evaluation demonstrates the superior performance of our approach, which achieves the dual objectives of preserving identity and facilitating editability. Project page: https://photoverse2d.github.io/

研究动机与目标

Motivate faster, tuning-free personalization of T2I models with a single reference image.
Preserve identity while enabling flexible editing and style variation.
Reduce resource cost and test-time tuning requirements compared to prior methods.

提出的方法

Employ a dual-branch conditioning mechanism in text and image domains to inject concepts into a Stable Diffusion-based model.
Use a lightweight adapter-based architecture (text adapters via LoRA and visual adapters) to map reference image features to pseudo-words and image tokens.
Fine-tune only cross-attention weights with PEFT (LoRA) while keeping the rest of the model frozen.
Introduce a facial identity loss to enforce identity preservation via a cosine similarity objective on face features (ArcFace).
Fuse textual and visual conditioning in cross-attention with a random fusion strategy controlled by hyperparameters (gamma, sigma).
Train with regularizations on both textual and visual embeddings to promote sparsity and generalization.

实验结果

研究问题

RQ1How can personalized text-to-image generation be made instantaneous and tuning-free using a single reference image?
RQ2Does a dual-branch conditioning (text and image) improve identity preservation and editability over single-branch or optimization-based methods?
RQ3What is the impact of facial identity loss and regularizations on identity preservation and generalization?
RQ4How does the proposed method compare to DreamBooth, Textual Inversion, E4T, and ProFusion in terms of speed, data requirements, and output quality?

主要发现

方法	黑色	棕色	白色	黄色	全部
无视觉条件分支	0.561	0.563	0.584	0.556	0.556
无 L^S_reg	0.566	0.573	0.589	0.550	0.569
无 L^face	0.632	0.658	0.663	0.622	0.643
无 L^T_reg	0.650	0.668	0.678	0.657	0.663
PhotoVerse	0.685	0.702	0.715	0.682	0.696

PhotoVerse achieves test-time tuning-free personalization, generating high-quality images within seconds using a single reference photo.
The dual-branch conditioning improves identity preservation and editability by leveraging both textual embeddings and visual features.
A facial identity loss contributes to identity preservation, improving identity metrics by about 0.05 in ablations.
Ablation shows the visual conditioning branch significantly impacts identity similarity (e.g., drop from 0.696 to lower values when removed).
Qualitative results show sharper, more detailed outputs with better hair and facial feature preservation compared to several baselines (DreamBooth, Textual Inversion, E4T, ProFusion).
Quantitatively, the method achieves an average identity similarity of 0.696 across evaluated races, with Brown/White sometimes exceeding 0.70.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。