[论文解读] Making LLaMA SEE and Draw with SEED Tokenizer
SEED 引入了具有 1D 因果令牌的离散图像分词器,使 LLM 能够看到并绘制,从而带来 SEED-LLaMA——一个通过统一的下一个词预测处理交错的视觉和文本数据的多模态大语言模型的预训练与指令微调。
The great success of Large Language Models (LLMs) has expanded the potential of multimodality, contributing to the gradual evolution of General Artificial Intelligence (AGI). A true AGI agent should not only possess the capability to perform predefined multi-tasks but also exhibit emergent abilities in an open-world context. However, despite the considerable advancements made by recent multimodal LLMs, they still fall short in effectively unifying comprehension and generation tasks, let alone open-world emergent abilities. We contend that the key to overcoming the present impasse lies in enabling text and images to be represented and processed interchangeably within a unified autoregressive Transformer. To this end, we introduce SEED, an elaborate image tokenizer that empowers LLMs with the ability to SEE and Draw at the same time. We identify two crucial design principles: (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs. (2) Image tokens should capture high-level semantics consistent with the degree of semantic abstraction in words, and be optimized for both discriminativeness and reconstruction during the tokenizer training phase. With SEED tokens, LLM is able to perform scalable multimodal autoregression under its original training recipe, i.e., next-word prediction. SEED-LLaMA is therefore produced by large-scale pretraining and instruction tuning on the interleaved textual and visual data, demonstrating impressive performance on a broad range of multimodal comprehension and generation tasks. More importantly, SEED-LLaMA has exhibited compositional emergent abilities such as multi-turn in-context multimodal generation, acting like your AI assistant.
研究动机与目标
- 旨在统一表示文本和图像,以便在单一自回归 Transformer 中实现可互换的处理。
- 设计一个视觉分词器,其令牌为 1D 因果、与词语在语义上高度对齐,并且适合用作大语言模型的训练目标。
- 通过向现有 LLM 词汇表中添加离散的图像令牌,实现可扩展的多模态预训练和指令微调。
- 展示涌现的多模态能力,包括多轮图文生成和组合式图像生成。
提出的方法
- 提出 SEED,一个基于 VQ 的图像分词器,具备 ViT 编码器、Causal Q-Former、VQ 码本、MLP 和 UNet 解码器。
- 训练 Causal Q-Former,将 2D ViT 特征通过对比学习与图文标题进行转换,得到 1D 因果序列的嵌入。
- 通过 VQ 码本对因果嵌入离散化,产生 32 个因果视觉码,然后用 MLP 去标记化以与 unCLIP-SD 潜在空间对齐以进行图像生成。
- 通过在互嵌的图像-文本、视频-文本和图像-文本数据上执行统一的下一个词预测目标,对 SEED-LLaMA 进行多模态预训练。
- 应用多模态指令微调,通过监督微调(先基于 LoRA 的再全量微调)使 SEED-LLaMA 与人类指令保持对齐。
- 在多模态理解与生成任务上进行评估,包括图像字幕、VQA、视频问答和文本到图像生成,并通过多轮上下文多模态生成进行定性演示。

实验结果
研究问题
- RQ1一个 1D 因果的离散图像分词器是否能在语义上与词令牌对齐,从而实现统一的自回归多模态模型?
- RQ2SEED 是否能在原始的下一个词预测目标下实现可扩展的多模态预训练和指令微调?
- RQ3SEED-LLaMA 能展现出哪些涌现的多模态能力(如多轮上下文生成、组合式图像生成)?
- RQ4与现有多模态大模型方法相比,SEED 在视觉理解与生成基准上的表现如何?
主要发现
- SEED 分词器产生离散的因果视觉码,能够实现具有竞争力的图像-文本检索性能和高层语义表示。
- 使用冻结的 SD-UNet 从 SEED 令牌重建图像时,能保持与输入图像的语义一致性(基于 CLIP 的相似度接近 unCLIP-SD 的上界)。
- SEED-LLaMA 在图像、视频和文本任务上的多模态理解与生成表现具有竞争力,并展示了多轮上下文多模态生成能力。
- 指令微调和模型规模的增大提升了在 SEED-Bench 及相关基准上的表现。
- SEED 支持组合零样本图像生成,包括风格化图像生成、图像混合和多模态组合,且可通过指令引导。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。