QUICK REVIEW

[论文解读] Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser, Sumith Kulal|arXiv (Cornell University)|Mar 5, 2024

Computer Graphics and Visualization Techniques被引用 84

一句话总结

该论文通过引入噪声尺度偏置采样、具有独立模态权重的多模态文本–图像 transformer 骨干（MM-DiT），以及一个显示出与当前扩散模型竞争甚至优越结果的缩放研究，来改进高分辨率图像合成的 rectified flow 模型。

ABSTRACT

Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations. Our largest models outperform state-of-the-art models, and we will make our experimental data, code, and model weights publicly available.

研究动机与目标

通过对噪声尺度进行偏置，使高分辨率图像合成的 rectified flow 公式更偏向感知相关内容，从而改进。
开发一个可扩展的多模态 transformer 骨干，支持文本与图像 tokens 之间的双向信息流，以改进文本到图像生成。
系统比较扩散模型与 rectified-flow 的变体在不同数据集和采样设定下，识别更优的训练和采样策略。
展示所提模型在高达 8B 参数的缩放行为，并评估验证损失与图像-文本评估指标之间的相关性。

提出的方法

在 rectified flow 模型中重新加权噪声尺度，以偏向感知相关的时间步的训练，从而得到一个可以加权的噪声预测目标（L_w），以强调中间时间步。
比较包括 RF、EDM、以及 LDM 风格时间步表的变体，并使用定制的 SNR 采样器，如对数逻辑正态、基于模态、以及 CosMap 时间步分布。
引入 MM-DiT，一种多模态扩散骨干，图像与文本模态有两组独立权重，在交叉注意力和 MLP 处理中实现双向交互。
在高分辨率数据上进行预训练和微调，使用 QK 归一化以稳定注意力并实现 bf16 精度训练，同时将潜在通道扩展到 d=16，以获得更好的重建。
使用改进的自编码器（潜在空间 d=16），通过 CogVLM 生成的合成 captions 与原始 captions（50/50）混合，以及一个可扩展的、按模态的扩散骨干用于文本条件图像生成。

实验结果

研究问题

RQ1将时间步采样偏向中间、感知相关尺度是否能在高分辨率图像合成的 rectified flow 中提升性能，相比传统扩散公式？
RQ2具有图像/文本分开 token 流的多模态扩散骨干（MM-DiT）是否在文本到图像生成上优于传统骨干（DiT、CrossDiT、UViT）？
RQ3在 rectified-flow 基模型中 scaling 趋势如何展现，较低的验证损失如何转化为跨自动和人工评估的文本到图像性能提升？
RQ4数据预处理和字幕增强（合成+原始字幕）的对 GenEval 风格指标在大规模 T2I 模型上的影响？
RQ5哪些训练稳定化技术（QK 归一化、混合精度微调、适应不同纵横比的定位编码）对高分辨率微调至关重要？

主要发现

强调中间时间步的噪声采样策略（例如 rf/lognorm(0.00, 1.00)）在 CLIP 和 FID 指标上表现出色，且常常优于或匹配最先进的扩散模型。
具针对性时间步采样的 rectified flow 变体在若干设置中优于 LDM-Linear 和 EDM 基线，尤其是在较低的采样步数时。
MM-DiT 多模态骨干对文本与图像模态分离权重，在 CC12M 的验证损失、CLIP、FID 上显著优于未改 DiT、CrossDiT、和 UViT。
将自编码器的潜在通道增至 d=16 提高重建指标并支持更好的缩放；更高的容量与更好图像质量相关。
合成（CogVLM 生成）和原始字幕的 50/50 混合提升 GenEval 分数，表明合成字幕可以有效增加训练数据。
通过 QK 归一化实现训练稳定性， enabling bf16 混合精度下的稳定微调，促进高分辨率缩放和更好的注意行为。
扩展到 8B 参数的缩放实验表明，较低的验证损失与跨自动和人工评估的文本到图像性能提升相关。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。