QUICK REVIEW

[论文解读] VinVL: Making Visual Representations Matter in Vision-Language Models

Pengchuan Zhang, Xiujun Li|arXiv (Cornell University)|Jan 2, 2021

Multimodal Machine Learning Applications被引用 79

一句话总结

本文提出 VinVL，一种在广泛公开数据集上预训练的更大、设计更优的目标检测模型，用于为视觉语言（VL）任务生成更丰富的视觉表征。通过将这些改进的特征输入基于 Transformer 的 VL 融合模型（OSCAR+），该方法在七个公开基准上实现了最先进性能，证明了高质量视觉特征能显著提升 VL 模型性能。

ABSTRACT

This paper presents a detailed study of improving visual representations for vision language (VL)tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model [2], the new model is bigger,better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model OSCAR[21],and utilize an improved approach OSCAR+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. We will release the new object detection model to public.

研究动机与目标

通过开发更全面的目标检测模型，提升视觉语言任务的视觉表征质量。
解决先前 VL 研究过度关注融合模型而忽视视觉特征提取改进的问题。
在更大、更丰富的标注目标检测数据集语料上预训练目标检测模型，以实现更丰富的视觉概念覆盖。
证明仅通过更优的视觉特征即可显著提升下游 VL 模型性能。

提出的方法

设计并训练一个更大、更鲁棒的目标检测模型，专为视觉语言任务优化。
在多个公开目标检测数据集的合并语料上预训练该检测器，以提升视觉表征质量。
从新检测器中提取以对象为中心的视觉特征，并将其输入基于 Transformer 的 VL 融合模型（OSCAR+）。
在多样化的下游 VL 任务上，使用改进的预训练与微调策略（OSCAR+）对 VL 模型进行微调。
利用改进的视觉特征，提升在多个视觉语言基准上的性能。

实验结果

研究问题

RQ1一个更大、设计更优的目标检测模型能否显著提升视觉语言任务的视觉表征质量？
RQ2提升视觉特征质量是否能带来可测量的 VL 模型性能增益，且独立于融合模型的改进？
RQ3通过大规模预训练获得的更丰富的视觉表征，能在多大程度上提升下游 VL 基准的性能？
RQ4统一的视觉特征提取器能否在多样化视觉语言任务中实现最先进性能？

主要发现

VinVL 目标检测器生成的新视觉特征在所有评估的视觉语言任务中均显著提升性能。
该方法在七个公开视觉语言基准上达到新的最先进结果，表现出一致的性能增益。
性能提升主要归因于大规模预训练带来的视觉表征质量与多样性提升。
结果验证了视觉特征质量是 VL 模型性能的关键因素，而这一因素在以往研究中常被忽视。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。