QUICK REVIEW

[论文解读] Grounding Language Models to Images for Multimodal Inputs and Outputs

Jing Yu Koh, Ruslan Salakhutdinov|arXiv (Cornell University)|Jan 31, 2023

Multimodal Machine Learning Applications被引用 25

一句话总结

FROMAGe 将一个冻结的、仅文本的 LLM 锚定到视觉域，使用线性映射和一个检索令牌，使图文交错输入输出成为可能，并具备强大的零样本多模态能力。

ABSTRACT

We propose an efficient method to ground pretrained text-only language models to the visual domain, enabling them to process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images. Our method leverages the abilities of language models learnt from large scale text-only pretraining, such as in-context learning and free-form text generation. We keep the language model frozen, and finetune input and output linear layers to enable cross-modality interactions. This allows our model to process arbitrarily interleaved image-and-text inputs, and generate free-form text interleaved with retrieved images. We achieve strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue, and showcase compelling interactive abilities. Our approach works with any off-the-shelf language model and paves the way towards an effective, general solution for leveraging pretrained language models in visually grounded settings.

研究动机与目标

利用大规模仅文本的 LLM 实现多模态理解，而无需对模型进行全面微调。
实现任意交错的图像和文本输入的处理。
允许生成与检索图像交错的自由文本。
在定位任务和多模态对话中实现强大的零样本性能。
提供一种与模型无关的可扩展方法，能够随着未来更大 LLM 的出现而扩展。

提出的方法

在训练期间保持语言模型和视觉编码器冻结。
学习轻量级的转换层，将图像嵌入空间映射到文本嵌入空间，反之亦然。
引入一个 [RET] 令牌，并训练其嵌入以支持文本到图像的检索。
采用多任务目标进行训练：通过对比学习进行图像描述和图像-文本检索。
使用线性映射将视觉嵌入投影到文本空间（以及反向），以实现跨模态交互。
在 Conceptual Captions (CC3M) 上训练，使用单一的 6.7B OPT 主干网络和基于 CLIP 的视觉主干。

实验结果

研究问题

RQ1一个冻结的语言模型是否能够有效地将视觉锚定，用于处理交错的图像-文本数据？
RQ2添加专用的 [RET] 令牌是否会在自回归 LLM 中改善文本到图像的检索？
RQ3多模态上下文（多个图注和图像）会如何影响检索和生成性能？
RQ4在多大程度上可以通过轻量级、与模型无关的定位方法保留 LLM 的上下文学习能力，同时实现多模态输出？

主要发现

FROMAGe 在上下文相关的图像检索和多模态对话中实现了强大的零样本性能。
模型以高效方式训练（约 1 GPU 日），97% 的参数保持冻结，仅更新线性映射和 [RET] 令牌。
添加一个 [RET] 令牌显著提升检索性能（特别是在 VIST 上的 R@1）。
多模态上下文（图注和图像）显著提升检索相较于仅文本上下文，表明对交错输入的敏感性。
FROMAGe 在许多多模态上下文场景下优于 CLIP，并实现了一些早期模型不支持的交错图像-文本输出。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。