QUICK REVIEW

[论文解读] Are we pretraining it right? Digging deeper into visio-linguistic pretraining

Amanpreet Singh, Vedanuj Goswami|arXiv (Cornell University)|Apr 19, 2020

Multimodal Machine Learning Applications参考文献 58被引用 30

一句话总结

本文系统研究预训练数据域（视觉/文本）如何影响 visio-linguistic 迁移，结果显示同域或合成数据可超越标准的大型但不匹配的数据集，而简单的预训练选择即可在不改变架构的情况下获得近似SOTA的结果。

ABSTRACT

Numerous recent works have proposed pretraining generic visio-linguistic representations and then finetuning them for downstream vision and language tasks. While architecture and objective function design choices have received attention, the choice of pretraining datasets has received little attention. In this work, we question some of the default choices made in literature. For instance, we systematically study how varying similarity between the pretraining dataset domain (textual and visual) and the downstream domain affects performance. Surprisingly, we show that automatically generated data in a domain closer to the downstream task (e.g., VQA v2) is a better choice for pretraining than "natural" data but of a slightly different domain (e.g., Conceptual Captions). On the other hand, some seemingly reasonable choices of pretraining datasets were found to be entirely ineffective for some downstream tasks. This suggests that despite the numerous recent efforts, vision & language pretraining does not quite work "out of the box" yet. Overall, as a by-product of our study, we find that simple design choices in pretraining can help us achieve close to state-of-art results on downstream tasks without any architectural changes.

研究动机与目标

研究预训练数据的领域相似性与下游任务在视觉-语言模型中的表现影响。
评估更大的预训练数据集是否总是有帮助，以及数据质量和领域对迁移的影响。
探索预训练在低资源下游任务何时有帮助，何时可能不必要或有害。
通过冻结模型基础并与完整微调比较，检查预训练表示的可迁移性。
提出合成的同域数据生成作为在标注数据稀缺时提升预训练的可扩展替代方案。

提出的方法

对比两大视觉-语言架构 VisualBERT 与 ViLBERT，在三种预训练数据集上用 MLM 和 MMM 目标训练。
在四个下游任务评估（VQA-D, VizWiz, SNLI-VE, MM-IMDB），并与不同的预训练数据域匹配程度进行比较。
系统性地改变预训练数据集规模（包括 CC-Small、CC-full），并衡量对下游性能的影响。
冻结预训练基础，以量化学习表征的迁移性。
通过对域内图像进行字幕描述，生成同域数据集（CCG）并将其有效性与自然的同域/异域数据进行比较。
通过自注意权重的 L1 和角距离分析表征漂移。

实验结果

研究问题

RQ1视觉和文本域之间的相似性在预训练数据和下游任务之间对 visio-linguistic 迁移的影响如何？
RQ2是否使用最大的预训练数据集总是有益，还是领域和质量更为重要？
RQ3当下游任务资源稀缺时，预训练的迁移表现是否更佳，哪些因素影响？
RQ4合成的同域数据能否缩小领域差距并在不改变架构的情况下提升预训练效果？
RQ5在不同的预训练条件下，哪种架构（VisualBERT vs ViLBERT）产生更具可迁移性的表示？

主要发现

与下游任务对齐的预训练域在 VisualBERT 与 ViLBERT 的表现上均优于跨域预训练。
VQA-P 与 COCO 预训练通常优于 CC-small/-full，特别是当视觉和文本域均匹配下游任务时；MM-IMDB 很少从预训练中受益。
与随机初始化相比，预训练提升了迁移性，但收益因任务而异；一些低资源任务的预训练收益很小甚至为负。
生成的同域数据集（CCG）可优于异域 CC，并接近同域性能，表明在标注数据稀缺时有一种可扩展的提升预训练的路径。
在所报告的设定中，VisualBERT 表现普遍优于 ViLBERT；冻结基础模型在某些任务中显示 ViLBERT 的迁移性有限。
用于微调的最佳预训练模型并不总是等同于冻结基础模型时的最具迁移性的模型；微调可以恢复下游任务的最佳性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。