[Paper Review] ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models
The paper shows a tuning-free method to generate ultra-high-resolution images from pre-trained diffusion models by dynamically adjusting the convolutional receptive field with re-dilation and dispersion, plus noise-damped guidance, achieving higher fidelity at resolutions up to 4096×4096 without retraining.
In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.
Motivation & Objective
- Motivate higher-resolution image synthesis beyond training resolution without fine-tuning.
- Identify the structural cause of object repetition when upsampling high-res images from low-res diffusion models.
- Propose a tuning-free re-dilation strategy to expand receptive fields during inference.
- Introduce dispersed convolution and noise-damped classifier-free guidance to enable ultra-high-res generation.
- Demonstrate effectiveness across multiple Stable Diffusion versions and a text-to-video model.
Proposed method
- Analyze U-Net components to identify receptive-field limitations as the main cause of repetition.
- Introduce re-dilation to dynamically adjust convolutional perception fields during inference (including fractional and layer/timestep-aware schedules).
- Propose dispersed convolution to enlarge kernels while preserving pre-trained behavior via structure- and pixel-level calibration.
- Develop noise-damped classifier-free guidance to balance denoising ability with high-resolution content generation.
- Compare against training-free baselines and a diffusion super-resolution model, showing quantitative gains in FID/KID and qualitative texture/detail improvements.
Experimental results
Research questions
- RQ1Can a pre-trained diffusion model trained on low-resolution data generate plausible ultra-high-resolution images without additional training?
- RQ2Is the object repetition issue in high-resolution synthesis primarily due to limited convolutional receptive fields rather than attention token count?
- RQ3Can inference-time re-dilation and kernel dispersion enlarge the receptive field effectively without retraining?
- RQ4Does noise-damped classifier-free guidance improve quality and texture at ultra-high resolutions?
- RQ5How does the proposed method perform across different SD versions and in a text-to-video setting?
Key findings
- Re-dilation addressing convolutional receptive field effectively mitigates object repetition and improves structure at high resolutions.
- Dispersed convolution with structure- and pixel-level calibration enlarges the effective receptive field without training, enabling higher resolutions.
- Fractional/layer-timestep-aware re-dilation schedules yield better results than fixed dilation across all layers/steps.
- Noise-damped classifier-free guidance preserves denoising while enabling high-frequency content, improving texture and detail.
- Quantitative results show improved FID and KID over Direct-Inf and Attn-SF across SD 1.5, 2.1, and XL 1.0 for 4×, 6.25×, 8×, and 16× upscaling; qualitative gains in texture and detail; successful application to text-to-video.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.