QUICK REVIEW

[論文レビュー] SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

Yanyu Li, Huan Wang|arXiv (Cornell University)|Jun 1, 2023

Generative Adversarial Networks and Image Synthesis被引用数 35

ひとこと要約

SnapFusion は、UNet アーキテクチャ、画像デコーダ、ステップ蒸留を最適化することで、2 秒未満で実行されるモバイル端末上のテキストから画像への拡散モデルを提供し、ステップ数が大幅に少なくても SD-v1.5 に競合する品質を実現します。

ABSTRACT

Text-to-image diffusion models can create stunning images from natural language descriptions that rival the work of professional artists and photographers. However, these models are large, with complex network architectures and tens of denoising iterations, making them computationally expensive and slow to run. As a result, high-end GPUs and cloud-based inference are required to run diffusion models at scale. This is costly and has privacy implications, especially when user data is sent to a third party. To overcome these challenges, we present a generic approach that, for the first time, unlocks running text-to-image diffusion models on mobile devices in less than $2$ seconds. We achieve so by introducing efficient network architecture and improving step distillation. Specifically, we propose an efficient UNet by identifying the redundancy of the original model and reducing the computation of the image decoder via data distillation. Further, we enhance the step distillation by exploring training strategies and introducing regularization from classifier-free guidance. Our extensive experiments on MS-COCO show that our model with $8$ denoising steps achieves better FID and CLIP scores than Stable Diffusion v$1.5$ with $50$ steps. Our work democratizes content creation by bringing powerful text-to-image diffusion models to the hands of users.

研究の動機と目的

Identify bottlenecks in on-device diffusion models and quantify latency sources on mobile hardware.
Develop an architecture-evolving UNet to reduce computation without sacrificing image quality.
Compress and distill the image decoder to cut memory and compute demands.
Advance step distillation with classifier-free guidance regularization to preserve quality with fewer steps.

提案手法

Analyze Stable Diffusion v1.5 to locate latency bottlenecks in Text Encoder, UNet, and VAE Decoder.
Propose architecture-evolving UNet with robust training to tolerate block-level permutations and remove redundancy.
Compress and distill the image decoder using a data/distillation pipeline with synthetic prompts.
Apply step distillation to reduce inference steps from 50 to 8 while maintaining quality.
Introduce CFG-aware step distillation with a CFG-guided loss and a loss-mixing scheme to balance FID and CLIP.
Use CFG-aware distillation and original loss jointly, with dynamic gamma to harmonize distillation objectives.

Figure 1: Example generated images by using our efficient text-to-image diffusion model.

実験結果

リサーチクエスチョン

RQ1How can UNet architecture redundancies be exploited to speed up on-device diffusion without degrading quality?
RQ2What training strategies enable robust architecture evolution of UNet for mobile diffusion?
RQ3Can a compressed image decoder maintain perceptual quality while reducing parameters and MACs?
RQ4What is the impact of reducing denoising steps via step distillation on FID and CLIP scores on mobile devices?
RQ5Does CFG-aware step distillation improve CLIP scores while preserving FID at low-step regimes?

主な発見

An 8-step on-device UNet with a distilled image decoder achieves faster-than-2-second generation on mobile hardware with competitive SD-v1.5-like quality.
Architecture evolution with robust training preserves pre-trained performance while allowing block-level pruning/removal for speed.
Efficient image decoder achieves 3.8× fewer parameters and 3.2× speedup compared to SD-v1.5, via 50% channel pruning.
CFG-aware step distillation improves CLIP scores for low-step models while maintaining reasonable FID, outperforming vanilla distillation in CLIP at similar FID.
Direct 16→8-step distillation beats progressive distillation in both FID and CLIP under the same inference budget.
On MS-COCO 2017 5K, the 8-step model achieves 24.2 FID and 0.30 CLIP, outperforming several baselines.

Figure 2: Latency (iPhone 14 Pro, ms) and parameter (M) analysis for cross-attention (CA) and ResNet blocks in the UNet of Stable Diffusion.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。