Skip to main content
QUICK REVIEW

[Paper Review] ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

Yingqing He, Shaoshu Yang|arXiv (Cornell University)|Oct 11, 2023
Generative Adversarial Networks and Image Synthesis14 citations
TL;DR

The paper shows a tuning-free method to generate ultra-high-resolution images from pre-trained diffusion models by dynamically adjusting the convolutional receptive field with re-dilation and dispersion, plus noise-damped guidance, achieving higher fidelity at resolutions up to 4096×4096 without retraining.

ABSTRACT

In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.

Motivation & Objective

  • Motivate higher-resolution image synthesis beyond training resolution without fine-tuning.
  • Identify the structural cause of object repetition when upsampling high-res images from low-res diffusion models.
  • Propose a tuning-free re-dilation strategy to expand receptive fields during inference.
  • Introduce dispersed convolution and noise-damped classifier-free guidance to enable ultra-high-res generation.
  • Demonstrate effectiveness across multiple Stable Diffusion versions and a text-to-video model.

Proposed method

  • Analyze U-Net components to identify receptive-field limitations as the main cause of repetition.
  • Introduce re-dilation to dynamically adjust convolutional perception fields during inference (including fractional and layer/timestep-aware schedules).
  • Propose dispersed convolution to enlarge kernels while preserving pre-trained behavior via structure- and pixel-level calibration.
  • Develop noise-damped classifier-free guidance to balance denoising ability with high-resolution content generation.
  • Compare against training-free baselines and a diffusion super-resolution model, showing quantitative gains in FID/KID and qualitative texture/detail improvements.

Experimental results

Research questions

  • RQ1Can a pre-trained diffusion model trained on low-resolution data generate plausible ultra-high-resolution images without additional training?
  • RQ2Is the object repetition issue in high-resolution synthesis primarily due to limited convolutional receptive fields rather than attention token count?
  • RQ3Can inference-time re-dilation and kernel dispersion enlarge the receptive field effectively without retraining?
  • RQ4Does noise-damped classifier-free guidance improve quality and texture at ultra-high resolutions?
  • RQ5How does the proposed method perform across different SD versions and in a text-to-video setting?

Key findings

  • Re-dilation addressing convolutional receptive field effectively mitigates object repetition and improves structure at high resolutions.
  • Dispersed convolution with structure- and pixel-level calibration enlarges the effective receptive field without training, enabling higher resolutions.
  • Fractional/layer-timestep-aware re-dilation schedules yield better results than fixed dilation across all layers/steps.
  • Noise-damped classifier-free guidance preserves denoising while enabling high-frequency content, improving texture and detail.
  • Quantitative results show improved FID and KID over Direct-Inf and Attn-SF across SD 1.5, 2.1, and XL 1.0 for 4×, 6.25×, 8×, and 16× upscaling; qualitative gains in texture and detail; successful application to text-to-video.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.