QUICK REVIEW

[논문 리뷰] ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

Yingqing He, Shaoshu Yang|arXiv (Cornell University)|2023. 10. 11.

Generative Adversarial Networks and Image Synthesis인용 수 14

한 줄 요약

본 논문은 재학습 없이 컨볼루션 수용영역을 재확대와 분산으로 동적으로 조정하고, 노이즈 감소 안내를 더해 초고해상도 이미지를 생성하는 튜닝 불필요한 방법을 제시하며, 재학습 없이 4096×4096 해상도까지 더 높은 충실도를 달성한다.

ABSTRACT

In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.

연구 동기 및 목표

Motivate higher-resolution image synthesis beyond training resolution without fine-tuning.
Identify the structural cause of object repetition when upsampling high-res images from low-res diffusion models.
Propose a tuning-free re-dilation strategy to expand receptive fields during inference.
Introduce dispersed convolution and noise-damped classifier-free guidance to enable ultra-high-res generation.
Demonstrate effectiveness across multiple Stable Diffusion versions and a text-to-video model.

제안 방법

Analyze U-Net components to identify receptive-field limitations as the main cause of repetition.
Introduce re-dilation to dynamically adjust convolutional perception fields during inference (including fractional and layer/timestep-aware schedules).
Propose dispersed convolution to enlarge kernels while preserving pre-trained behavior via structure- and pixel-level calibration.
Develop noise-damped classifier-free guidance to balance denoising ability with high-resolution content generation.
Compare against training-free baselines and a diffusion super-resolution model, showing quantitative gains in FID/KID and qualitative texture/detail improvements.

실험 결과

연구 질문

RQ1Can a pre-trained diffusion model trained on low-resolution data generate plausible ultra-high-resolution images without additional training?
RQ2Is the object repetition issue in high-resolution synthesis primarily due to limited convolutional receptive fields rather than attention token count?
RQ3Can inference-time re-dilation and kernel dispersion enlarge the receptive field effectively without retraining?
RQ4Does noise-damped classifier-free guidance improve quality and texture at ultra-high resolutions?
RQ5How does the proposed method perform across different SD versions and in a text-to-video setting?

주요 결과

Method	SD 1.5 FID r	SD 1.5 KID r	SD 1.5 FID b	SD 1.5 KID b	SD 2.1 FID r	SD 2.1 KID r	SD 2.1 FID b	SD 2.1 KID b	SD XL 1.0 FID r	SD XL 1.0 KID r	SD XL 1.0 FID b	SD XL 1.0 KID b
Direct-Inf	38.50	0.014	29.30	0.008	29.89	0.010	24.21	0.007	67.71	0.029	45.55	0.014
Attn-SF	38.59	0.013	29.30	0.008	28.95	0.010	22.75	0.007	68.93	0.028	46.07	0.013
Ours	32.67	0.012	24.93	0.007	20.88	0.008	16.67	0.005	64.75	0.024	28.15	0.009
Direct-Inf	55.47	0.020	48.54	0.015	52.58	0.018	48.13	0.014	93.91	0.041	54.90	0.020
Attn-SF	55.96	0.020	49.03	0.015	50.62	0.017	45.57	0.014	93.92	0.042	54.89	0.019
Ours	52.11	0.019	45.86	0.014	33.36	0.010	30.66	0.008	80.72	0.032	47.15	0.015
Direct-Inf	74.52	0.032	68.98	0.027	69.89	0.029	55.48	0.020	122.41	0.062	82.51	0.037
Attn-SF	74.42	0.032	68.81	0.027	68.97	0.029	53.97	0.020	122.21	0.062	82.35	0.037
Ours	58.21	0.022	52.76	0.017	58.57	0.021	49.41	0.015	119.58	0.057	50.70	0.019
Direct-Inf	111.34	0.046	106.70	0.042	104.70	0.043	104.10	0.040	153.33	0.070	144.99	0.061
Attn-SF	110.10	0.046	105.42	0.042	104.34	0.043	103.61	0.041	153.68	0.070	144.84	0.061
Ours	78.22	0.027	65.86	0.023	59.40	0.021	57.26	0.018	131.03	0.063	124.01	0.055

Re-dilation addressing convolutional receptive field effectively mitigates object repetition and improves structure at high resolutions.
Dispersed convolution with structure- and pixel-level calibration enlarges the effective receptive field without training, enabling higher resolutions.
Fractional/layer-timestep-aware re-dilation schedules yield better results than fixed dilation across all layers/steps.
Noise-damped classifier-free guidance preserves denoising while enabling high-frequency content, improving texture and detail.
Quantitative results show improved FID and KID over Direct-Inf and Attn-SF across SD 1.5, 2.1, and XL 1.0 for 4×, 6.25×, 8×, and 16× upscaling; qualitative gains in texture and detail; successful application to text-to-video.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.