QUICK REVIEW

[论文解读] Generating Images with Multimodal Language Models

Jing Yu Koh, Daniel Fried|arXiv (Cornell University)|May 26, 2023

Multimodal Machine Learning Applications被引用 39

一句话总结

GILL 将冻结的文本专用 LLM 与预训练的图像编码/解码器通过映射网络融合，以处理交错的图像-文本输入，实现在多模态对话中的文本生成、图像检索和新图像生成。

ABSTRACT

We propose a method to fuse frozen text-only large language models (LLMs) with pre-trained image encoder and decoder models, by mapping between their embedding spaces. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue. Ours is the first approach capable of conditioning on arbitrarily interleaved image and text inputs to generate coherent image (and text) outputs. To achieve strong performance on image generation, we propose an efficient mapping network to ground the LLM to an off-the-shelf text-to-image generation model. This mapping network translates hidden representations of text into the embedding space of the visual models, enabling us to leverage the strong text representations of the LLM for visual outputs. Our approach outperforms baseline generation models on tasks with longer and more complex language. In addition to novel image generation, our model is also capable of image retrieval from a prespecified dataset, and decides whether to retrieve or generate at inference time. This is done with a learnt decision module which conditions on the hidden representations of the LLM. Our model exhibits a wider range of capabilities compared to prior multimodal language models. It can process image-and-text inputs, and produce retrieved images, generated images, and generated text -- outperforming non-LLM based generation models across several text-to-image tasks that measure context dependence.

研究动机与目标

在不对 LLM 重新训练的情况下，利用冻结的纯文本 LLM 处理多模态任务。
将 LLM 的嵌入映射到预训练的图像生成器，以合成新的图像。
实现对任意交错的图像和文本输入的处理，以获得连贯的多模态输出。
在推理阶段开发一个决策机制，在图像检索和生成之间进行选择。
在长篇语言上下文和多模态对话中展示性能提升。

提出的方法

引入 GILL：一个框架，保持 LLM 和图像生成器冻结，只微调少量适配器和映射。
学习一个描述映射 W_cap，将图像特征投射到 LLM 的嵌入空间以进行描述生成。
添加多个 [IMG] 令牌及其嵌入矩阵 E_img，用于在 LLM 中表示视觉输出。
开发 GILLMapper，一个轻量级 Transformer，将 LLM 的 [IMG] 输出提炼到图像生成器（Stable Diffusion）输入空间。
使用 InfoNCE 损失，用线性投影 W_t2i 与 W_i2t 训练检索路径，以对齐图像和字幕。
训练一个决策模型，根据 LLM 隐藏状态在检索和生成之间进行选择（在其他组件收敛后训练）。
优化一个联合多任务损失，结合字幕生成、图像标记预测、生成和检索损失（l_c、l_p、l_g、l_r）。
使用 CC3M 进行训练，采用两示例打包；主干 LLM 为 OPT-6.7B；视觉主干为 CLIP ViT-L；生成主干为 Stable Diffusion v1.5；50M 可训练参数；两块 GPU；训练时长 2 天。

实验结果

研究问题

RQ1是否可以将冻结的纯文本 LLM 有效地与图像生成器对接，以在交错的图像-文本提示条件下生成新颖图像？
RQ2是否可以使用最少量的可训练组件，在连贯的多模态对话中检索或生成图像并将它们与文本输出交错？
RQ3与标准文本到图像模型相比，通过 GILLMapper 的对接是否在长上下文或多模态场景中提升了图像生成？
RQ4情境多模态输入如何影响检索与生成的决策？

主要发现

模型	CLIP 相似度（1 条描述）	CLIP 相似度（5 条描述）	CLIP 相似度（5 条描述，4 张图片）	LPIPS（1 条描述）	LPIPS（5 条描述）	LPIPS（5 条描述，4 张图片）
GLIDE	0.582	0.591	-	0.753	0.745	-
Stable Diffusion	0.592 ±0.0007	0.598 ±0.0006	-	0.703 ±0.0003	0.704 ±0.0004	-
GILL (ours)	0.581 ±0.0005	0.612 ±0.0011	0.641 ±0.0011	0.702 ±0.0004	0.696 ±0.0008	0.693 ±0.0008

GILL 能在交错的图像-文本输入条件下实现文本生成、图像检索以及新颖图像的生成。
在 VIST 数据集上，随着更长的多模态上下文，GILL 在 CLIP 相似度和 LPIPS 方面优于 Stable Diffusion，特别是在提供完整多模态上下文时。
在 VisDial 上，随着对话轮数增加，GILL 提升了图像生成质量，在较长上下文下超越了仅文本基线。
GILLMapper 在将 LLM 嵌入映射到图像生成器输入空间方面显著优于基线（线性/MLP/4 层编码器），实现更好的 FID 与基于 CLIP 的指标。
检索与生成的决策模型可以在训练后学习，在实现有竞争力的检索性能的同时，在需要时启用生成。
使用 r = 4 个 [IMG] 令牌在生成质量和效率之间提供平衡；增加 r 会提升性能，直至达到平台期。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。