[논문 리뷰] ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models
본 논문은 재학습 없이 컨볼루션 수용영역을 재확대와 분산으로 동적으로 조정하고, 노이즈 감소 안내를 더해 초고해상도 이미지를 생성하는 튜닝 불필요한 방법을 제시하며, 재학습 없이 4096×4096 해상도까지 더 높은 충실도를 달성한다.
In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.
연구 동기 및 목표
- Motivate higher-resolution image synthesis beyond training resolution without fine-tuning.
- Identify the structural cause of object repetition when upsampling high-res images from low-res diffusion models.
- Propose a tuning-free re-dilation strategy to expand receptive fields during inference.
- Introduce dispersed convolution and noise-damped classifier-free guidance to enable ultra-high-res generation.
- Demonstrate effectiveness across multiple Stable Diffusion versions and a text-to-video model.
제안 방법
- Analyze U-Net components to identify receptive-field limitations as the main cause of repetition.
- Introduce re-dilation to dynamically adjust convolutional perception fields during inference (including fractional and layer/timestep-aware schedules).
- Propose dispersed convolution to enlarge kernels while preserving pre-trained behavior via structure- and pixel-level calibration.
- Develop noise-damped classifier-free guidance to balance denoising ability with high-resolution content generation.
- Compare against training-free baselines and a diffusion super-resolution model, showing quantitative gains in FID/KID and qualitative texture/detail improvements.
실험 결과
연구 질문
- RQ1Can a pre-trained diffusion model trained on low-resolution data generate plausible ultra-high-resolution images without additional training?
- RQ2Is the object repetition issue in high-resolution synthesis primarily due to limited convolutional receptive fields rather than attention token count?
- RQ3Can inference-time re-dilation and kernel dispersion enlarge the receptive field effectively without retraining?
- RQ4Does noise-damped classifier-free guidance improve quality and texture at ultra-high resolutions?
- RQ5How does the proposed method perform across different SD versions and in a text-to-video setting?
주요 결과
| Method | SD 1.5 FID r | SD 1.5 KID r | SD 1.5 FID b | SD 1.5 KID b | SD 2.1 FID r | SD 2.1 KID r | SD 2.1 FID b | SD 2.1 KID b | SD XL 1.0 FID r | SD XL 1.0 KID r | SD XL 1.0 FID b | SD XL 1.0 KID b |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Direct-Inf | 38.50 | 0.014 | 29.30 | 0.008 | 29.89 | 0.010 | 24.21 | 0.007 | 67.71 | 0.029 | 45.55 | 0.014 |
| Attn-SF | 38.59 | 0.013 | 29.30 | 0.008 | 28.95 | 0.010 | 22.75 | 0.007 | 68.93 | 0.028 | 46.07 | 0.013 |
| Ours | 32.67 | 0.012 | 24.93 | 0.007 | 20.88 | 0.008 | 16.67 | 0.005 | 64.75 | 0.024 | 28.15 | 0.009 |
| Direct-Inf | 55.47 | 0.020 | 48.54 | 0.015 | 52.58 | 0.018 | 48.13 | 0.014 | 93.91 | 0.041 | 54.90 | 0.020 |
| Attn-SF | 55.96 | 0.020 | 49.03 | 0.015 | 50.62 | 0.017 | 45.57 | 0.014 | 93.92 | 0.042 | 54.89 | 0.019 |
| Ours | 52.11 | 0.019 | 45.86 | 0.014 | 33.36 | 0.010 | 30.66 | 0.008 | 80.72 | 0.032 | 47.15 | 0.015 |
| Direct-Inf | 74.52 | 0.032 | 68.98 | 0.027 | 69.89 | 0.029 | 55.48 | 0.020 | 122.41 | 0.062 | 82.51 | 0.037 |
| Attn-SF | 74.42 | 0.032 | 68.81 | 0.027 | 68.97 | 0.029 | 53.97 | 0.020 | 122.21 | 0.062 | 82.35 | 0.037 |
| Ours | 58.21 | 0.022 | 52.76 | 0.017 | 58.57 | 0.021 | 49.41 | 0.015 | 119.58 | 0.057 | 50.70 | 0.019 |
| Direct-Inf | 111.34 | 0.046 | 106.70 | 0.042 | 104.70 | 0.043 | 104.10 | 0.040 | 153.33 | 0.070 | 144.99 | 0.061 |
| Attn-SF | 110.10 | 0.046 | 105.42 | 0.042 | 104.34 | 0.043 | 103.61 | 0.041 | 153.68 | 0.070 | 144.84 | 0.061 |
| Ours | 78.22 | 0.027 | 65.86 | 0.023 | 59.40 | 0.021 | 57.26 | 0.018 | 131.03 | 0.063 | 124.01 | 0.055 |
- Re-dilation addressing convolutional receptive field effectively mitigates object repetition and improves structure at high resolutions.
- Dispersed convolution with structure- and pixel-level calibration enlarges the effective receptive field without training, enabling higher resolutions.
- Fractional/layer-timestep-aware re-dilation schedules yield better results than fixed dilation across all layers/steps.
- Noise-damped classifier-free guidance preserves denoising while enabling high-frequency content, improving texture and detail.
- Quantitative results show improved FID and KID over Direct-Inf and Attn-SF across SD 1.5, 2.1, and XL 1.0 for 4×, 6.25×, 8×, and 16× upscaling; qualitative gains in texture and detail; successful application to text-to-video.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.