QUICK REVIEW

[论文解读] SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

Yanyu Li, Huan Wang|arXiv (Cornell University)|Jun 1, 2023

Generative Adversarial Networks and Image Synthesis被引用 35

一句话总结

SnapFusion 在移动设备本地部署的文本到图像扩散模型，通过优化 UNet 架构、图像解码器和步骤蒸馏，在不到 2 秒内运行，达到与 SD-v1.5 相当的质量，但所需步数大幅减少。

ABSTRACT

Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. However, these models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. As a result, high-end GPUs and cloud-based inference are required to run diffusion models at scale. This is costly and has privacy implications, especially when user data is sent to a third party. To overcome these challenges, we present a generic approach that, for the first time, unlocks running text-to-image diffusion models on mobile devices in less than $2$ seconds. We achieve so by introducing efficient network architecture and improving step distillation. Specifically, we propose an efficient UNet by identifying the redundancy of the original model and reducing the computation of the image decoder via data distillation. Further, we enhance the step distillation by exploring training strategies and introducing regularization from classifier-free guidance. Our extensive experiments on MS-COCO show that our model with $8$ denoising steps achieves better FID and CLIP scores than Stable Diffusion v$1.5$ with $50$ steps. Our work democratizes content creation by bringing powerful text-to-image diffusion models to the hands of users.

研究动机与目标

识别设备端扩散模型的瓶颈并量化移动硬件上的时延来源。
开发一个具备架构演化能力的 UNet，以在不牺牲图像质量的前提下降低计算量。
压缩并蒸馏图像解码器，以降低内存和计算需求。
通过带有分类器自由引导正则化的步数蒸馏，在更少步数下保持质量。

提出的方法

分析 Stable Diffusion v1.5 以定位 Text Encoder、UNet 和 VAE Decoder 的时延瓶颈。
提出具备鲁棒训练的架构演化型 UNet，以容忍块级置换并消除冗余。
使用带有合成提示的数据/蒸馏管线，对图像解码器进行压缩和蒸馏。
应用步数蒸馏，在保持质量的前提下将推理步数从 50 降至 8。
引入 CFG 感知的步数蒸馏，配以 CFG 指导损失和损失混合方案，在 FID 与 CLIP 之间取得平衡。
联合使用 CFG 感知蒸馏和原始损失，配合动态 gamma 以协调蒸馏目标。

Figure 1: Example generated images by using our efficient text-to-image diffusion model.

实验结果

研究问题

RQ1如何利用 UNet 架构中的冗余来在不降低质量的前提下提速设备端扩散？
RQ2哪些训练策略能够实现移动扩散中 UNet 架构的鲁棒演化？
RQ3在降低参数量与 MACs 的同时，压缩的图像解码器能否维持感知质量？
RQ4在移动设备上通过步数蒸馏减少去噪步数对 FID 与 CLIP 分数的影响如何？
RQ5在低步数区间内，CFG 感知的步数蒸馏是否在保持 FID 的同时提升 CLIP 分数？

主要发现

8 步的设备端 UNet 搭配蒸馏图像解码器，在移动硬件上实现低于 2 秒的生成，且具有与 SD-v1.5 相竞争的质量。
具备鲁棒训练的架构演化在保持预训练性能的同时允许进行块级剪枝/移除以提升速度。
高效的图像解码器通过 50% 通道剪枝，使参数量减少 3.8×，速度提升 3.2×，相对于 SD-v1.5。
CFG 感知步数蒸馏在低步模型上提升 CLIP 分数，同时维持合理的 FID，在相似 FID 下的 CLIP 分数优于普通蒸馏。
在相同推理预算下，直接从 16 步蒸馏到 8 步在 FID 与 CLIP 上均优于逐步蒸馏。
在 MS-COCO 2017 5K 上，8 步模型实现 24.2 FID 和 0.30 CLIP，优于若干基线。

Figure 2: Latency (iPhone 14 Pro, ms) and parameter (M) analysis for cross-attention (CA) and ResNet blocks in the UNet of Stable Diffusion.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。