QUICK REVIEW

[论文解读] Language Models Can See: Plugging Visual Controls in Text Generation

Yixuan Su, Lü Tian|arXiv (Cornell University)|May 5, 2022

Multimodal Machine Learning Applications被引用 38

一句话总结

MAGIC 是一种无需训练的解码方案，通过 CLIP 基于视觉控制来约束 GPT-2 文本生成，实现零-shot 图像描述和以视觉为基础的故事生成，具有最先进的结果且解码速度大约提升约27x。

ABSTRACT

Generative language models (LMs) such as GPT-2/3 can be prompted to generate text with remarkable quality. While they are designed for text-prompted generation, it remains an open question how the generation process could be guided by modalities beyond text such as images. In this work, we propose a training-free framework, called MAGIC (iMAge-Guided text generatIon with CLIP), for plugging in visual controls in the generation process and enabling LMs to perform multimodal tasks (e.g., image captioning) in a zero-shot manner. MAGIC is a simple yet efficient plug-and-play framework, which directly combines an off-the-shelf LM (i.e., GPT-2) and an image-text matching model (i.e., CLIP) for image-grounded text generation. During decoding, MAGIC influences the generation of the LM by introducing a CLIP-induced score, called magic score, which regularizes the generated result to be semantically related to a given image while being coherent to the previously generated context. Notably, the proposed decoding scheme does not involve any gradient update operation, therefore being computationally efficient. On the challenging task of zero-shot image captioning, MAGIC outperforms the state-of-the-art method by notable margins with a nearly 27 times decoding speedup. MAGIC is a flexible framework and is theoretically compatible with any text generation tasks that incorporate image grounding. In the experiments, we showcase that it is also capable of performing visually grounded story generation given both an image and a text prompt.

研究动机与目标

说明如何用非文本模态（特别是图像）来引导语言模型生成。
提出一个无需训练的解码框架（MAGIC）以在视觉内容上锚定文本生成。
证明在图像描述和视觉为基础的故事讲述方面的零-shot 性能。
显示 MAGIC 在性能上超越基线并在梯度方法上提供显著的解码速度提升。

提出的方法

提出 MAGIC Search，一种解码方案，加入 CLIP 诱导的 magic score 来引导标记选择。
将 magic score 定义为基于 top-k 候选标记的 CLIP 影像-文本相似度分布（Eq. 5）。
在标记选择目标函数中将模型置信度和降解惩罚与 magic score 一起考虑（Eq. 4）。
对任务特定文本语料进行对比学习目标的微调，以校准表示（L_MLE + L_CL）。
解码期间不需要梯度更新，使多模态无训练生成更加高效。
证明与任何可以视觉上锚定的文本生成任务的兼容性。

实验结果

研究问题

RQ1一个无需训练的解码策略是否能够有效地将视觉锚定引入到预训练语言模型？
RQ2与基于梯度的方法相比，CLIP-锚定解码对零-shot 的图像描述质量和速度有何影响？
RQ3MAGIC 是否能够实现除描述之外的其他多模态生成任务，如视觉为基础的故事讲述？

主要发现

模型	MS-COCO B@1	MS-COCO B@4	MS-COCO M	MS-COCO R-L	MS-COCO CIDEr	MS-COCO SPICE	Flickr30k B@1	Flickr30k B@4	Flickr30k M	Flickr30k R-L	Flickr30k CIDEr	Flickr30k SPICE	速度
Supervised Approach	77.2	36.2	27.0	56.4	113.5	20.3	27.3	21.7	-	56.6	16.0	-	-
GVD	-	-	-	-	-	-	66.9	27.3	22.5	-	62.3	16.5	-	-
UniVLP	-	36.5	28.4	-	116.9	21.2	-	30.1	23.0	-	67.4	17.0	-	-
ClipCap	-	33.5	27.5	-	113.1	21.1	-	-	-	-	-	-	-	-
Oscar	-	36.5	30.3	-	123.7	23.1	-	-	-	-	-	-	-	-
LEMON	-	40.3	30.2	-	133.3	23.3	-	-	-	-	-	-	-	-
Weakly Supervised Approach - UIC	41.0	5.6	12.4	28.7	28.6	8.1	-	-	-	-	-	-	-	-
IC-SME	-	6.5	12.9	35.1	22.7	-	-	7.9	13.0	32.8	9.9	-	-	-
S2S-SS	49.5	6.3	14.0	34.5	31.9	8.6	-	-	-	-	-	-	-	-
S2S-GCC	50.4	7.6	13.5	37.3	31.8	8.4	-	-	-	-	-	-	-	-
Unsupervised - Top-k	33.6	2.4	8.3	25.6	3.8	1.7	34.0	2.9	9.0	24.4	3.3	2.7	69.9x	-
Unsupervised - Nucleus	32.6	2.3	7.8	24.8	3.1	1.4	32.6	2.4	8.1	23.4	2.5	2.4	72.5x	-
Unsupervised - Contrastive	39.5	3.0	10.8	30.8	7.7	2.9	37.6	4.3	9.8	25.7	8.9	4.6	1.0x	-
CLIPRe	39.5	4.9	11.4	29.0	13.6	5.3	38.5	5.2	11.6	27.6	10.0	5.7	-	-
ZeroCap	49.8	7.0	15.4	31.8	34.5	9.2	44.7	5.4	11.8	27.3	16.8	6.2	1.0x	-
MAGIC	56.8	12.9	17.4	39.9	49.3	11.3	44.5	6.4	13.1	31.6	20.4	7.1	26.6x	-

MAGIC 在零-shot 图像描述基准测试（MS-COCO 和 Flickr30k）上实现了多项指标的状态-of-the-art 表现。
MAGIC 相比基于梯度的 ZeroCap 方法，解码速度约快 27x。
MAGIC 展现出跨域鲁棒性，在跨域评估中超越基线。
MAGIC 扩展到视觉为基础的故事生成，自动质量和人工评估质量均高于基线。
解码阶段保持无需训练，仅有一个简短、对任务影响极小的微调步骤。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。