QUICK REVIEW

[论文解读] Linearly Mapping from Image to Text Space

Jack Merullo, Louis Castricato|arXiv (Cornell University)|Sep 30, 2022

Multimodal Machine Learning Applications被引用 25

一句话总结

本文表明，单线性投影即可将图像表示映射到冻结的语言模型输入空间，从而生成字幕并回答问题，在不对 LM 或图像编码器进行微调的情况下实现与众多视觉-语言任务相竞争的性能。性能取决于图像编码器在预训练阶段的语言监督程度。

ABSTRACT

The extent to which text-only language models (LMs) learn to represent features of the non-linguistic world is an open question. Prior work has shown that pretrained LMs can be taught to caption images when a vision model's parameters are optimized to encode images in the language space. We test a stronger hypothesis: that the conceptual representations learned by frozen text-only models and vision-only models are similar enough that this can be achieved with a linear map. We show that the image representations from vision models can be transferred as continuous prompts to frozen LMs by training only a single linear projection. Using these to prompt the LM achieves competitive performance on captioning and visual question answering tasks compared to models that tune both the image encoder and text decoder (such as the MAGMA model). We compare three image encoders with increasing amounts of linguistic supervision seen during pretraining: BEIT (no linguistic information), NF-ResNET (lexical category information), and CLIP (full natural language descriptions). We find that all three encoders perform equally well at transferring visual property information to the language model (e.g., whether an animal is large or small), but that image encoders pretrained with linguistic supervision more saliently encode category information (e.g., distinguishing hippo vs. elephant) and thus perform significantly better on benchmark language-and-vision tasks. Our results indicate that LMs encode conceptual information structurally similarly to vision-based models, even those that are solely trained on images. Code is available here: https://github.com/jmerullo/limber

研究动机与目标

测试冻结的仅文本LM在接收线性映射后的图像表示（软提示）时是否能描述图像。
研究具有不同语言监督的图像编码器在预训练中如何将视觉概念转移到LM。
与联合微调的多模态基线进行比较，以评估端到端微调的必要性。

提出的方法

从图像编码器表示到LM输入空间训练单一线性投影P，以创建图像提示。
在训练P时冻结图像编码器E和语言模型LM，使用图像字幕目标。
通过提示LM生成字幕并在VL任务上回答问题来评估迁移，而不对LM或编码器进行微调。
测试具有不同预训练语言监督的多个编码器：CLIP RN50x16、NF-ResNet50、BEIT-Large（及变体）。
使用 Conceptual Captions 3M 进行训练；与基线 MAGMA 和 NFRN50 变体进行比较。

实验结果

研究问题

RQ1一个图像编码器与冻结LM之间的线性映射是否能够在不更新LM或编码器权重的情况下支持准确的图像字幕和VQA？
RQ2图像编码器预训练中的语言监督量如何影响迁移到LM？
RQ3视觉与语言空间之间的表征相似性是否足以实现跨编码器的有效零-shot 转移？
RQ4从不同编码器通过线性提示传递视觉信息时会出现哪些错误和局限性？

主要发现

线性投影可以将视觉信息传递给冻结的LM，以生成字幕并在回答问题时实现具有竞争力的性能。
性能与图像编码器的语言监督相关；在许多VL任务中，CLIP和NF-ResNet50的表现优于BEIT，尽管BEIT仍然传递出粗糙的感知信息。
与仅训练线性投影（LiMBeR 基线）相比，微调图像编码器或LM并不始终带来好处。
具语言监督的编码器能够转移词汇类别概念，而纯视觉编码器主要传递粗略感知信息。
BEIT 提示倾向于生成更模糊的字幕，难以进行精确词汇分类，但仍然向LM传达感知上的相似性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。