Skip to main content
QUICK REVIEW

[논문 리뷰] ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

Yingqing He, Shaoshu Yang|arXiv (Cornell University)|2023. 10. 11.
Generative Adversarial Networks and Image Synthesis인용 수 14
한 줄 요약

본 논문은 재학습 없이 컨볼루션 수용영역을 재확대와 분산으로 동적으로 조정하고, 노이즈 감소 안내를 더해 초고해상도 이미지를 생성하는 튜닝 불필요한 방법을 제시하며, 재학습 없이 4096×4096 해상도까지 더 높은 충실도를 달성한다.

ABSTRACT

In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.

연구 동기 및 목표

  • Motivate higher-resolution image synthesis beyond training resolution without fine-tuning.
  • Identify the structural cause of object repetition when upsampling high-res images from low-res diffusion models.
  • Propose a tuning-free re-dilation strategy to expand receptive fields during inference.
  • Introduce dispersed convolution and noise-damped classifier-free guidance to enable ultra-high-res generation.
  • Demonstrate effectiveness across multiple Stable Diffusion versions and a text-to-video model.

제안 방법

  • Analyze U-Net components to identify receptive-field limitations as the main cause of repetition.
  • Introduce re-dilation to dynamically adjust convolutional perception fields during inference (including fractional and layer/timestep-aware schedules).
  • Propose dispersed convolution to enlarge kernels while preserving pre-trained behavior via structure- and pixel-level calibration.
  • Develop noise-damped classifier-free guidance to balance denoising ability with high-resolution content generation.
  • Compare against training-free baselines and a diffusion super-resolution model, showing quantitative gains in FID/KID and qualitative texture/detail improvements.

실험 결과

연구 질문

  • RQ1Can a pre-trained diffusion model trained on low-resolution data generate plausible ultra-high-resolution images without additional training?
  • RQ2Is the object repetition issue in high-resolution synthesis primarily due to limited convolutional receptive fields rather than attention token count?
  • RQ3Can inference-time re-dilation and kernel dispersion enlarge the receptive field effectively without retraining?
  • RQ4Does noise-damped classifier-free guidance improve quality and texture at ultra-high resolutions?
  • RQ5How does the proposed method perform across different SD versions and in a text-to-video setting?

주요 결과

MethodSD 1.5 FID rSD 1.5 KID rSD 1.5 FID bSD 1.5 KID bSD 2.1 FID rSD 2.1 KID rSD 2.1 FID bSD 2.1 KID bSD XL 1.0 FID rSD XL 1.0 KID rSD XL 1.0 FID bSD XL 1.0 KID b
Direct-Inf38.500.01429.300.00829.890.01024.210.00767.710.02945.550.014
Attn-SF38.590.01329.300.00828.950.01022.750.00768.930.02846.070.013
Ours32.670.01224.930.00720.880.00816.670.00564.750.02428.150.009
Direct-Inf55.470.02048.540.01552.580.01848.130.01493.910.04154.900.020
Attn-SF55.960.02049.030.01550.620.01745.570.01493.920.04254.890.019
Ours52.110.01945.860.01433.360.01030.660.00880.720.03247.150.015
Direct-Inf74.520.03268.980.02769.890.02955.480.020122.410.06282.510.037
Attn-SF74.420.03268.810.02768.970.02953.970.020122.210.06282.350.037
Ours58.210.02252.760.01758.570.02149.410.015119.580.05750.700.019
Direct-Inf111.340.046106.700.042104.700.043104.100.040153.330.070144.990.061
Attn-SF110.100.046105.420.042104.340.043103.610.041153.680.070144.840.061
Ours78.220.02765.860.02359.400.02157.260.018131.030.063124.010.055
  • Re-dilation addressing convolutional receptive field effectively mitigates object repetition and improves structure at high resolutions.
  • Dispersed convolution with structure- and pixel-level calibration enlarges the effective receptive field without training, enabling higher resolutions.
  • Fractional/layer-timestep-aware re-dilation schedules yield better results than fixed dilation across all layers/steps.
  • Noise-damped classifier-free guidance preserves denoising while enabling high-frequency content, improving texture and detail.
  • Quantitative results show improved FID and KID over Direct-Inf and Attn-SF across SD 1.5, 2.1, and XL 1.0 for 4×, 6.25×, 8×, and 16× upscaling; qualitative gains in texture and detail; successful application to text-to-video.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.