QUICK REVIEW

[论文解读] VinVL: Revisiting Visual Representations in Vision-Language Models

Pengchuan Zhang, Xiujun Li|arXiv (Cornell University)|Jan 2, 2021

Multimodal Machine Learning Applications参考文献 42被引用 60

一句话总结

作者开发了一个大型面向对象的视觉检测器，在多个数据集上进行训练以产生更丰富的视觉特征，并将其与增强的 Oscar+ VL 预训练管线结合，在七个视觉-语言任务上取得了新的最先进成果。

ABSTRACT

This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used \emph{bottom-up and top-down} model \cite{anderson2018bottom}, the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar \cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. We will release the new object detection model to public.

研究动机与目标

证明更丰富的视觉特征显著影响视觉-语言性能。
开发一个覆盖多样对象和属性的大规模对象检测模型，以用于 VL 任务。
使用增强的视觉特征对统一的视觉-语言模型（Oscar+）进行预训练和微调，以改善多个 VL 基准。

提出的方法

在统一语料库（结合 COCO、OpenImages、Objects365 和 Visual Genome）上预训练一个大型对象检测器，以产生 1848 个对象类别，包括 524 个属性。
注入一个属性分支并在 Visual Genome 上进行微调，以增强对象-属性检测。
使用高效的区域特征提取器，加速 VL 任务的特征提取。
使用三向对比损失对 Oscar+ 进行预训练，使字幕/问答与图像标签和区域对齐。
在七个 VL 任务上微调 Oscar+，包括 VQA、GQA、NLVR2、图像描述、NoCaps、图像/文本检索。

实验结果

研究问题

RQ1提高视觉特征的质量和多样性是否会提升视觉-语言任务的性能？
RQ2当与基于 Transformer 的 VL 融合模型结合时，一个更大、更多样的面向对象的检测器是否能提升下游的 VL 理解与生成任务？
RQ3在数据、模型结构和预训练目标中的哪些设计选择对 VL 的提升贡献最大？
RQ4新视觉特征如何影响对识别类任务（VQA、GQA）以及生成/检索任务（描述、NoCaps、检索、NLVR2）的性能？

主要发现

用 VinVL 的更丰富的区域特征替换之前的 OD 特征，在七个 VL 任务上实现了一致的最先进提升。
VinVL 的提升相当显著，分析表明约 95% 的总体改进归因于视觉特征的增强。
新对象检测器增加了对语义意义区域的覆盖，并丰富了对象概念与属性。
配合 VinVL 的 Oscar+ 在 VQA、GQA、NLVR2、NoCaps 和检索任务上实现了新的 SOTA，在图像描述上也取得了有竞争力的结果。
高效的区域特征提取和加入属性使推理更快且不牺牲准确性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。