[论文解读] Subject-driven Text-to-Image Generation via Apprenticeship Learning
SuTI 训练一个单一的学徒扩散模型来模仿数千个主体特定的专家模型,实现在上下文中的主体驱动图像生成且无需测试时微调。它在保真度和速度上表现出色,在若干指标上超过 DreamBooth。
Recent text-to-image generation models like DreamBooth have made remarkable progress in generating highly customized images of a target subject, by fine-tuning an ``expert model'' for a given subject from a few examples. However, this process is expensive, since a new expert model must be learned for each subject. In this paper, we present SuTI, a Subject-driven Text-to-Image generator that replaces subject-specific fine tuning with in-context learning. Given a few demonstrations of a new subject, SuTI can instantly generate novel renditions of the subject in different scenes, without any subject-specific optimization. SuTI is powered by apprenticeship learning, where a single apprentice model is learned from data generated by a massive number of subject-specific expert models. Specifically, we mine millions of image clusters from the Internet, each centered around a specific visual subject. We adopt these clusters to train a massive number of expert models, each specializing in a different subject. The apprentice model SuTI then learns to imitate the behavior of these fine-tuned experts. SuTI can generate high-quality and customized subject-specific images 20x faster than optimization-based SoTA methods. On the challenging DreamBench and DreamBench-v2, our human evaluation shows that SuTI significantly outperforms existing models like InstructPix2Pix, Textual Inversion, Imagic, Prompt2Prompt, Re-Imagen and DreamBooth, especially on the subject and text alignment aspects.
研究动机与目标
- Motivate efficient, scalable subject-driven image generation without per-subject fine-tuning.
- Leverage apprenticeship learning to imitate a large set of expert models with a single apprentice model.
- Enable in-context generation for unseen subjects and compositions using few demonstrations.
- Evaluate against DreamBench and DreamBench-v2 with both automatic and human metrics.
提出的方法
- Train many subject-specific expert diffusion models from mined image-text clusters.
- Synthesize pseudo-targets from expert outputs to train a single apprentice diffusion model.
- Use delta CLIP filtering to ensure high-quality expert outputs for apprentice training.
- During inference, generate new images from 3-5 in-context demonstrations without optimization.
- Scale training with distributed TPU-based parallel fine-tuning of experts and apprentice.
- Compare with baselines using CLIP-DINO/CLIP-I/CLIP-T and human evaluations.
实验结果
研究问题
- RQ1Can a single apprentice diffusion model generalize to unseen subjects and compositions without test-time fine-tuning?
- RQ2How does in-context demonstration count influence subject fidelity and text alignment in SuTI?
- RQ3What is the impact of data quality filtering (delta CLIP) on final generation performance?
- RQ4How does SuTI compare to DreamBooth and other subject-driven methods on DreamBench and DreamBench-v2?
主要发现
- SuTI achieves instant, in-context generation for unseen subjects with 3-5 demonstrations and no per-subject optimization.
- On DreamBench, SuTI attains a DINO score of 0.741, CLIP-I of 0.819, and CLIP-T of 0.304, outperforming DreamBooth on DINO and matching CLIP-T.
- Human evaluation on DreamBench-v2 shows SuTI surpasses DreamBooth by 5% overall and outperforms other baselines by at least 30%.
- Delta CLIP filtering quality critically influences performance; higher thresholds improve human scores despite smaller training sets.
- Dream-SuTI (finetuned on subject images) further improves quality, achieving higher overall scores than SuTI and DreamBooth.
- SuTI runs ~20 seconds per subject during inference, with 3-5 demonstrations, and has a smaller memory footprint than many fine-tuning approaches.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。