QUICK REVIEW

[论文解读] SimVLM: Simple Visual Language Model Pretraining with Weak Supervision

Zirui Wang, Jiahui Yu|arXiv (Cornell University)|Aug 24, 2021

Multimodal Machine Learning Applications参考文献 54被引用 342

一句话总结

SimVLM 通过端到端在弱对齐的图像-文本数据上使用单一 Prefix Language Modeling 目标进行预训练，在VL基准测试上取得最先进的结果，并具备强大的零-shot 转移能力。

ABSTRACT

With recent progress in joint modeling of visual and textual representations, Vision-Language Pretraining (VLP) has achieved impressive performance on many multimodal downstream tasks. However, the requirement for expensive annotations including clean image captions and regional labels limits the scalability of existing approaches, and complicates the pretraining procedure with the introduction of multiple dataset-specific objectives. In this work, we relax these constraints and present a minimalist pretraining framework, named Simple Visual Language Model (SimVLM). Unlike prior work, SimVLM reduces the training complexity by exploiting large-scale weak supervision, and is trained end-to-end with a single prefix language modeling objective. Without utilizing extra data or task-specific customization, the resulting model significantly outperforms previous pretraining methods and achieves new state-of-the-art results on a wide range of discriminative and generative vision-language benchmarks, including VQA (+3.74% vqa-score), NLVR2 (+1.17% accuracy), SNLI-VE (+1.37% accuracy) and image captioning tasks (+10.1% average CIDEr score). Furthermore, we demonstrate that SimVLM acquires strong generalization and transfer ability, enabling zero-shot behavior including open-ended visual question answering and cross-modality transfer.

研究动机与目标

激发一个简单、可扩展的视觉-语言预训练框架，减少对昂贵标注和复杂目标的依赖。
证明对原始图像和文本进行端到端前缀语言建模可以达到或超过基于 MLM 的 VLP 方法。
展示在大规模弱监督下的强零-shot 泛化能力和跨模态转移。

提出的方法

使用 Transformer 主干处理原始图像补丁和文本标记，而不使用对象检测器。
采用 Prefix Language Modeling，使得对前缀进行双向编码并对剩余标记进行自回归文本生成。
在大规模弱对齐的图像-文本和纯文本数据上从头进行预训练，使用单一 LM 损失。
在补丁嵌入之前通过 Conv 阶段整合图像补丁，并对图像标记应用 2D 相对注意力。
在单阶段的 pretraining–finetuning 流水线中对标准 VL 基准进行微调。

实验结果

研究问题

RQ1一个最小、生成式的视觉-语言预训练框架仅用语言建模目标进行训练，是否能在 VL 基准上达到 SOTA？
RQ2PrefixLM 是否在没有任务特定损失或对象检测器的情况下实现有效的零-shot 与跨模态转移？
RQ3使用弱标注的图像-文本数据（以及纯文本数据）与基于检测的预训练在 VL 任务上有何差异？
RQ4架构选择（图像补丁、Conv 阶段、位置编码）对 VL 性能有何影响？
RQ5模型是否能在零-shot 设置下表现出开放式 VQA 和跨模态转移？

主要发现

SimVLM 在六个 VL 基准上取得了最先进的结果且无需额外数据或任务特定损失。
在 VQA 上，SimVLM_base、Large 和 Huge 超越了先前的方法，Huge 的 VQA 得分超过 80%。
在 NLVR2 和 SNLI-VE 上，SimVLM 在多种模型规模下达到新的 SOTA/近- SOTA 精度。
图像描述生成与 NoCaps/Multi30k 显示出显著提升，包括平均 CIDEr 提升约 10 点。
通过扩展和弱监督，出现零-shot 跨模态转移和开放式 VQA 能力。
跨模态转移（先进行文本型微调再在 VL 任务上评估）与有监督基线相比取得具有竞争力的结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。