QUICK REVIEW

[论文解读] Subject-driven Text-to-Image Generation via Apprenticeship Learning

Wenhu Chen, Hexiang Hu|arXiv (Cornell University)|Apr 1, 2023

Multimodal Machine Learning Applications被引用 46

一句话总结

SuTI 训练一个单一的学徒扩散模型来模仿数千个主体特定的专家模型，实现在上下文中的主体驱动图像生成且无需测试时微调。它在保真度和速度上表现出色，在若干指标上超过 DreamBooth。

ABSTRACT

Recent text-to-image generation models like DreamBooth have made remarkable progress in generating highly customized images of a target subject, by fine-tuning an ``expert model'' for a given subject from a few examples. However, this process is expensive, since a new expert model must be learned for each subject. In this paper, we present SuTI, a Subject-driven Text-to-Image generator that replaces subject-specific fine tuning with in-context learning. Given a few demonstrations of a new subject, SuTI can instantly generate novel renditions of the subject in different scenes, without any subject-specific optimization. SuTI is powered by apprenticeship learning, where a single apprentice model is learned from data generated by a massive number of subject-specific expert models. Specifically, we mine millions of image clusters from the Internet, each centered around a specific visual subject. We adopt these clusters to train a massive number of expert models, each specializing in a different subject. The apprentice model SuTI then learns to imitate the behavior of these fine-tuned experts. SuTI can generate high-quality and customized subject-specific images 20x faster than optimization-based SoTA methods. On the challenging DreamBench and DreamBench-v2, our human evaluation shows that SuTI significantly outperforms existing models like InstructPix2Pix, Textual Inversion, Imagic, Prompt2Prompt, Re-Imagen and DreamBooth, especially on the subject and text alignment aspects.

研究动机与目标

Motivate efficient, scalable subject-driven image generation without per-subject fine-tuning.
Leverage apprenticeship learning to imitate a large set of expert models with a single apprentice model.
Enable in-context generation for unseen subjects and compositions using few demonstrations.
Evaluate against DreamBench and DreamBench-v2 with both automatic and human metrics.

提出的方法

Train many subject-specific expert diffusion models from mined image-text clusters.
Synthesize pseudo-targets from expert outputs to train a single apprentice diffusion model.
Use delta CLIP filtering to ensure high-quality expert outputs for apprentice training.
During inference, generate new images from 3-5 in-context demonstrations without optimization.
Scale training with distributed TPU-based parallel fine-tuning of experts and apprentice.
Compare with baselines using CLIP-DINO/CLIP-I/CLIP-T and human evaluations.

实验结果

研究问题

RQ1Can a single apprentice diffusion model generalize to unseen subjects and compositions without test-time fine-tuning?
RQ2How does in-context demonstration count influence subject fidelity and text alignment in SuTI?
RQ3What is the impact of data quality filtering (delta CLIP) on final generation performance?
RQ4How does SuTI compare to DreamBooth and other subject-driven methods on DreamBench and DreamBench-v2?

主要发现

SuTI achieves instant, in-context generation for unseen subjects with 3-5 demonstrations and no per-subject optimization.
On DreamBench, SuTI attains a DINO score of 0.741, CLIP-I of 0.819, and CLIP-T of 0.304, outperforming DreamBooth on DINO and matching CLIP-T.
Human evaluation on DreamBench-v2 shows SuTI surpasses DreamBooth by 5% overall and outperforms other baselines by at least 30%.
Delta CLIP filtering quality critically influences performance; higher thresholds improve human scores despite smaller training sets.
Dream-SuTI (finetuned on subject images) further improves quality, achieving higher overall scores than SuTI and DreamBooth.
SuTI runs ~20 seconds per subject during inference, with 3-5 demonstrations, and has a smaller memory footprint than many fine-tuning approaches.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。