[论文解读] ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models
该论文展示了一种无需微调即可从预训练扩散模型生成超高分辨率图像的方法,通过在推理阶段动态调整卷积感受野(重新扩张与分散)以及噪声抑制引导,在不重新训练的情况下实现高达 4096×4096 的分辨率的更高保真度。
In this work, we investigate the capability of generating images from pre-trained diffusion models at much higher resolutions than the training image sizes. In addition, the generated images should have arbitrary image aspect ratios. When generating images directly at a higher resolution, 1024 x 1024, with the pre-trained Stable Diffusion using training images of resolution 512 x 512, we observe persistent problems of object repetition and unreasonable object structures. Existing works for higher-resolution generation, such as attention-based and joint-diffusion approaches, cannot well address these issues. As a new perspective, we examine the structural components of the U-Net in diffusion models and identify the crucial cause as the limited perception field of convolutional kernels. Based on this key observation, we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference. We further propose the dispersed convolution and noise-damped classifier-free guidance, which can enable ultra-high-resolution image generation (e.g., 4096 x 4096). Notably, our approach does not require any training or optimization. Extensive experiments demonstrate that our approach can address the repetition issue well and achieve state-of-the-art performance on higher-resolution image synthesis, especially in texture details. Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.
研究动机与目标
- 在不进行微调的情况下,推动超出训练分辨率的高分辨率图像合成。
- 识别在从低分辨率扩散模型上采样高分辨率图像时导致对象重复的结构性原因。
- 提出一个在推理阶段扩展感受野的无调参重新扩张策略。
- 引入分散卷积和噪声抑制的无条件导引,实现超高分辨率生成。
- 在多种 Stable Diffusion 版本和文本到视频模型上证明其有效性。
提出的方法
- 分析 U-Net 组件,找出感受野限制是重复现象的主要原因。
- 在推理阶段引入重新扩张以动态调整卷积感知域(包括分数步长和层/时间步感知的调度)。
- 提出分散卷积,通过结构级和像素级标定在不训练的情况下扩大核尺寸,同时保持预训练行为。
- 开发噪声抑制的无条件导引,在去噪能力与高分辨率内容生成之间实现平衡。
- 将与无训练基线和扩散超分辨模型进行对比,给出在 FID/KID 的定量提升以及纹理/细节的定性改进。
实验结果
研究问题
- RQ1一个在低分辨率数据上训练的预训练扩散模型是否能够在不进行额外训练的情况下生成看似可信的超高分辨率图像?
- RQ2高分辨率合成中的对象重复问题是否主要由于卷积感受野有限,而非注意力 token 的数量?
- RQ3推理时的重新扩张和核分散是否可在不重新训练的情况下有效扩大感受野?
- RQ4噪声抑制的无条件导引是否能在超高分辨率下提升质量与纹理?
- RQ5所提出的方法在不同的 SD 版本与文本到视频 setting 下表现如何?
主要发现
| 方法 | SD 1.5 FID r | SD 1.5 KID r | SD 1.5 FID b | SD 1.5 KID b | SD 2.1 FID r | SD 2.1 KID r | SD 2.1 FID b | SD 2.1 KID b | SD XL 1.0 FID r | SD XL 1.0 KID r | SD XL 1.0 FID b | SD XL 1.0 KID b |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Direct-Inf | 38.50 | 0.014 | 29.30 | 0.008 | 29.89 | 0.010 | 24.21 | 0.007 | 67.71 | 0.029 | 45.55 | 0.014 |
| Attn-SF | 38.59 | 0.013 | 29.30 | 0.008 | 28.95 | 0.010 | 22.75 | 0.007 | 68.93 | 0.028 | 46.07 | 0.013 |
| Ours | 32.67 | 0.012 | 24.93 | 0.007 | 20.88 | 0.008 | 16.67 | 0.005 | 64.75 | 0.024 | 28.15 | 0.009 |
| Direct-Inf | 55.47 | 0.020 | 48.54 | 0.015 | 52.58 | 0.018 | 48.13 | 0.014 | 93.91 | 0.041 | 54.90 | 0.020 |
| Attn-SF | 55.96 | 0.020 | 49.03 | 0.015 | 50.62 | 0.017 | 45.57 | 0.014 | 93.92 | 0.042 | 54.89 | 0.019 |
| Ours | 52.11 | 0.019 | 45.86 | 0.014 | 33.36 | 0.010 | 30.66 | 0.008 | 80.72 | 0.032 | 47.15 | 0.015 |
| Direct-Inf | 74.52 | 0.032 | 68.98 | 0.027 | 69.89 | 0.029 | 55.48 | 0.020 | 122.41 | 0.062 | 82.51 | 0.037 |
| Attn-SF | 74.42 | 0.032 | 68.81 | 0.027 | 68.97 | 0.029 | 53.97 | 0.020 | 122.21 | 0.062 | 82.35 | 0.037 |
| Ours | 58.21 | 0.022 | 52.76 | 0.017 | 58.57 | 0.021 | 49.41 | 0.015 | 119.58 | 0.057 | 50.70 | 0.019 |
| Direct-Inf | 111.34 | 0.046 | 106.70 | 0.042 | 104.70 | 0.043 | 104.10 | 0.040 | 153.33 | 0.070 | 144.99 | 0.061 |
| Attn-SF | 110.10 | 0.046 | 105.42 | 0.042 | 104.34 | 0.043 | 103.61 | 0.041 | 153.68 | 0.070 | 144.84 | 0.061 |
| Ours | 78.22 | 0.027 | 65.86 | 0.023 | 59.40 | 0.021 | 57.26 | 0.018 | 131.03 | 0.063 | 124.01 | 0.055 |
- 重新扩张有效地解决了卷积感受野导致的对象重复问题,并在高分辨率下改善结构。
- 分散卷积通过结构级和像素级标定在不训练的情况下扩展了有效感受野,使得可以实现更高的分辨率。
- 分数步/层感知的重新扩张调度在所有层/步上均优于固定扩张,得到更好结果。
- 噪声抑制的无条件导引在保持去噪的同时允许高频内容,改善纹理与细节。
- 定量结果显示在 4×、6.25×、8×、16× 等放大倍数下,相较 Direct-Inf 与 Attn-SF,在 SD 1.5、2.1、XL 1.0 的 FID 与 KID 均有提升;纹理和细节方面也有定性提升;也成功应用于文本到视频。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。