Skip to main content
QUICK REVIEW

[논문 리뷰] ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

Yingqing He, Shaoshu Yang|arXiv (Cornell University)|2023. 10. 11.
Generative Adversarial Networks and Image Synthesis인용 수 14
한 줄 요약

본 논문은 재학습 없이 컨볼루션 수용영역을 재확대와 분산으로 동적으로 조정하고, 노이즈 감소 안내를 더해 초고해상도 이미지를 생성하는 튜닝 불필요한 방법을 제시하며, 재학습 없이 4096×4096 해상도까지 더 높은 충실도를 달성한다.

ABSTRACT

In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.

연구 동기 및 목표

  • Motivate higher-resolution image synthesis beyond training resolution without fine-tuning.
  • Identify the structural cause of object repetition when upsampling high-res images from low-res diffusion models.
  • Propose a tuning-free re-dilation strategy to expand receptive fields during inference.
  • Introduce dispersed convolution and noise-damped classifier-free guidance to enable ultra-high-res generation.
  • Demonstrate effectiveness across multiple Stable Diffusion versions and a text-to-video model.

제안 방법

  • Analyze U-Net components to identify receptive-field limitations as the main cause of repetition.
  • Introduce re-dilation to dynamically adjust convolutional perception fields during inference (including fractional and layer/timestep-aware schedules).
  • Propose dispersed convolution to enlarge kernels while preserving pre-trained behavior via structure- and pixel-level calibration.
  • Develop noise-damped classifier-free guidance to balance denoising ability with high-resolution content generation.
  • Compare against training-free baselines and a diffusion super-resolution model, showing quantitative gains in FID/KID and qualitative texture/detail improvements.

실험 결과

연구 질문

  • RQ1Can a pre-trained diffusion model trained on low-resolution data generate plausible ultra-high-resolution images without additional training?
  • RQ2Is the object repetition issue in high-resolution synthesis primarily due to limited convolutional receptive fields rather than attention token count?
  • RQ3Can inference-time re-dilation and kernel dispersion enlarge the receptive field effectively without retraining?
  • RQ4Does noise-damped classifier-free guidance improve quality and texture at ultra-high resolutions?
  • RQ5How does the proposed method perform across different SD versions and in a text-to-video setting?

주요 결과

  • Re-dilation addressing convolutional receptive field effectively mitigates object repetition and improves structure at high resolutions.
  • Dispersed convolution with structure- and pixel-level calibration enlarges the effective receptive field without training, enabling higher resolutions.
  • Fractional/layer-timestep-aware re-dilation schedules yield better results than fixed dilation across all layers/steps.
  • Noise-damped classifier-free guidance preserves denoising while enabling high-frequency content, improving texture and detail.
  • Quantitative results show improved FID and KID over Direct-Inf and Attn-SF across SD 1.5, 2.1, and XL 1.0 for 4×, 6.25×, 8×, and 16× upscaling; qualitative gains in texture and detail; successful application to text-to-video.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.