QUICK REVIEW

[论文解读] 12-in-1: Multi-Task Vision and Language Representation Learning

Jiasen Lu, Vedanuj Goswami|arXiv (Cornell University)|Dec 5, 2019

Multimodal Machine Learning Applications参考文献 62被引用 36

一句话总结

本文提出一个基于 ViLBERT 的单一模型，在 12 个视觉-语言数据集上进行联合训练，覆盖四个任务组，在减少参数的同时实现竞争性或更优的结果，并实现对下游单任务微调的有效多任务预训练。

ABSTRACT

Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly. In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multi-task training regime. Our approach culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multi-modal verification. Compared to independently trained single-task models, this represents a reduction from approximately 3 billion parameters to 270 million while simultaneously improving performance by 2.05 points on average across tasks. We use our multi-task framework to perform in-depth analysis of the effect of joint training diverse tasks. Further, we show that finetuning task-specific models from our single multi-task model can lead to further improvements, achieving performance at or above the state-of-the-art.

研究动机与目标

推动对多样化的视觉-语言任务实现统一学习，以利用共享的定位与推理能力。
开发一种可扩展的多任务训练方案，能够应对数据集大小和难度差异。
证明联合训练在参数显著减少的同时，能够达到或超过独立单任务模型的性能。
展示多任务预训练对下游单任务微调的收益，并且在若干任务上能够达到最先进的结果。

提出的方法

将 ViLBERT 作为共享干线，配备针对四个任务组的 12 个数据集的任务特定头。
引入一个任务标记（每个数据集一个），在多任务训练中用于对当前任务进行条件化。
使用带有动态停止—继续（DSG）的轮转轮换批采样方案，以在大小和难度各异的任务间管理训练。
在 Conceptual Caption 上进行预训练，并采用改进的掩码策略以减少负样本的泄漏和噪声。
在单独任务上微调多任务模型，并与完全的任务特定基线进行比较。
提供关于任务标记粒度和训练计划的消融实验，以验证设计选择。

实验结果

研究问题

RQ1一个在多任务视觉-语言任务上训练的单模型，是否能优于或匹配独立训练的任务特定模型？
RQ2联合多任务训练作为下游单任务模型的预训练步骤是否带来收益？
RQ3哪些数据层面和任务层面的因素会影响视觉与语言任务之间的正向或负向迁移？
RQ4应如何安排多任务训练以应对数据集大小差异并防止过拟合或遗忘？
RQ5任务标记设计是否影响跨任务泛化与 grounding 一致性？

主要发现

在 12 个数据集上训练的单一模型，在 11 个任务上超越或等同于任务特定的最先进模型，平均分提升 2.05 点，同时将参数从 ~3B 减少到 270M。
多任务预训练随后进行单任务微调可带来显著提升，在若干任务上达到最先进水平。
多任务训练作为有效的预训练，并通过微调显示出更高的 grounding-aware 指标，从而提升跨任务的对齐一致性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。