Skip to main content
QUICK REVIEW

[论文解读] ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Jiasen Lu, Dhruv Batra|arXiv (Cornell University)|Aug 6, 2019
Multimodal Machine Learning Applications参考文献 30被引用 1,673
一句话总结

ViLBERT 引入了一个共注意力的双流模型,在大规模基于描述的数据上预训练视觉-语言表示,并转移到多样的 Vision-and-Language 任务,取得了最先进的结果。

ABSTRACT

We present ViLBERT (short for Vision-and-Language BERT), a model for learning task-agnostic joint representations of image content and natural language. We extend the popular BERT architecture to a multi-modal two-stream model, pro-cessing both visual and textual inputs in separate streams that interact through co-attentional transformer layers. We pretrain our model through two proxy tasks on the large, automatically collected Conceptual Captions dataset and then transfer it to multiple established vision-and-language tasks -- visual question answering, visual commonsense reasoning, referring expressions, and caption-based image retrieval -- by making only minor additions to the base architecture. We observe significant improvements across tasks compared to existing task-specific models -- achieving state-of-the-art on all four tasks. Our work represents a shift away from learning groundings between vision and language only as part of task training and towards treating visual grounding as a pretrainable and transferable capability.

研究动机与目标

  • 为视觉与语言(V+L)任务提供一个单一的、任务无关的预训练方法,能够转移到多样化的下游任务。
  • 开发一个共注意力的双流体系结构,有效融合视觉与文本信息。
  • 证明在大规模类似描述的数据上进行的预训练对 VQA、VCR 与指称表达任务的收益超出仅限于字幕检索的效果。

提出的方法

  • 引入一个共注意力的双流模型,利用跨模态注意力处理视觉和文本输入。
  • 在大规模视觉-语言数据上对模型进行预训练,以学习在 V+L 任务之间可泛化的基础表示。
  • 消融预训练组件(如掩码损失、对齐损失、共注意力)以评估它们对下游任务的影响。
  • 与基线进行比较,并讨论从字幕风格数据到非字幕风格任务(VQA、VCR、RefCOCO+)的迁移。
  • 可视化注意力模式以分析跨层和注意力方向的 grounding 行为。

实验结果

研究问题

  • RQ1单一的视觉-语言预训练目标是否能够产生有效转移到多个 V+L 任务的表示,而无需任务特定的头部?
  • RQ2共注意力的双流体系结构是否优于单模态模型(如 BERT)的视觉-语言任务扩展?
  • RQ3不同预训练组件(掩码、对齐、共注意力)对下游 V+L 性能有何影响?
  • RQ4在大规模描述数据(Conceptual Captions)上进行的预训练相比于未预训练或无 grounding 预训练,对 VQA、RefCOCO+、VCR 的表现有何影响?
  • RQ5模型在各层和模态之间的 grounding 与 attention grounding 属性是什么?

主要发现

  • Achieved improved performance on vision-and-language tasks and reportedly surpassed a recent VQA challenge winner, indicating strong state-of-the-art potential.
  • Ablations show that removing masking, alignment, or co-attention degrades downstream task performance, with masking loss being particularly critical.
  • Pretraining on Conceptual Captions enables transfer to V+L tasks beyond caption-based retrieval, despite domain differences between CC and downstream tasks.
  • Visualization indicates that image-to-text co-attention tends to ground early in layers while text-to-image co-attention grounds more with early layers and broader later layers.
  • The model demonstrates notable gains on VQA and RefCOCO+ in full pretraining versus w/o pretraining configurations, supporting the effectiveness of visio-linguistic pretraining.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。