[论文解读] Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Transfusion 训练一个单一的 Transformer,能够同时处理离散文本和连续图像数据,通过对文本的下一个 token 预测与图像的扩散进行联合优化,在与离散化图像基线相比时,实现强多模态扩展性与效率。
We introduce Transfusion, a recipe for training a multi-modal model over discrete and continuous data. Transfusion combines the language modeling loss function (next token prediction) with diffusion to train a single transformer over mixed-modality sequences. We pretrain multiple Transfusion models up to 7B parameters from scratch on a mixture of text and image data, establishing scaling laws with respect to a variety of uni- and cross-modal benchmarks. Our experiments show that Transfusion scales significantly better than quantizing images and training a language model over discrete image tokens. By introducing modality-specific encoding and decoding layers, we can further improve the performance of Transfusion models, and even compress each image to just 16 patches. We further demonstrate that scaling our Transfusion recipe to 7B parameters and 2T multi-modal tokens produces a model that can generate images and text on a par with similar scale diffusion models and language models, reaping the benefits of both worlds.
研究动机与目标
- Motivate a unified model that can process and generate both discrete (text) and continuous (image) modalities without information loss.
- Demonstrate that combining language modeling loss with diffusion loss in a single Transformer scales better than discretizing images.
- Show that modality-specific encoding/decoding layers and image compression via patches can improve performance and efficiency.
- Provide scaling laws and ablations to identify key components that drive multi-modal performance.
提出的方法
- Represent text as discrete tokens and images as latent patches from a VAE.
- Train a single Transformer with two losses: LM loss for text and DDPM diffusion loss for image patches, combined as L.Transfusion = L_LM + λ·L_DDPM.
- Use modality-specific embedding/decoding layers, with either a linear encoder/decoder or U-Net blocks for images.
- Apply causal attention across the sequence with intra-image bidirectional attention among patches to enable patch-to-patch communication.
- During inference, switch between text generation (LM mode) and image diffusion (diffusion mode) when BOI/EOI tokens are encountered.

实验结果
研究问题
- RQ1Can a single Transformer learn to model and generate both text and images without discrete quantization of images?
- RQ2How do LM and diffusion objectives interact in a unified multi-modal model, and what are the scaling properties across model sizes?
- RQ3What architectural choices (patch encoding, intra-image attention, image noising) most strongly impact multi-modal performance?
- RQ4How does Transfusion compare to Chameleon-style discretization baselines in terms of efficiency and quality across text and image tasks?
主要发现
| 模型 | C4 PPL | Wiki PPL | Llama Eval Acc | MS-COCO CIDEr | MS-COCO FID | CLIP |
|---|---|---|---|---|---|---|
| Transfusion (7B) | 7.72 | 4.28 | 61.5 | 27.2 | 16.8 | 25.5 |
| Chameleon (7B) | 8.41 | 4.69 | 59.1 | 18.0 | 29.6 | 24.3 |
- Transfusion models scale better than Chameleon across text-only and image-related tasks at comparable data and compute.
- For text-to-image generation, Transfusion achieves parity with Chameleon at about 1/3 the compute and lower FID by roughly 2× when FLOPs are controlled.
- In image-to-text and text-to-text tasks, Transfusion achieves strong results and can approach or match baselines’ performance with substantially less FLOPs (e.g., 21.8% of FLOPs for text-to-text).
- Ablations show intra-image bidirectional attention is beneficial, and U-Net down/up blocks for image encoding/decoding enable larger image patch compression with modest loss.
- Scaling up to 7B parameters with 2T multi-modal tokens yields image and text generation capabilities on par with contemporary diffusion and language models at similar scale

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。