QUICK REVIEW

[Paper Review] 12-in-1: Multi-Task Vision and Language Representation Learning

Jiasen Lu, Vedanuj Goswami|arXiv (Cornell University)|Dec 5, 2019

Multimodal Machine Learning Applications62 references36 citations

TL;DR

The paper presents a single ViLBERT-based model trained jointly on 12 vision-and-language datasets across four task groups, achieving competitive or superior results while reducing parameters and enabling effective multi-task pretraining for downstream single-task finetuning.

ABSTRACT

Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly. In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multi-task training regime. Our approach culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multi-modal verification. Compared to independently trained single-task models, this represents a reduction from approximately 3 billion parameters to 270 million while simultaneously improving performance by 2.05 points on average across tasks. We use our multi-task framework to perform in-depth analysis of the effect of joint training diverse tasks. Further, we show that finetuning task-specific models from our single multi-task model can lead to further improvements, achieving performance at or above the state-of-the-art.

Motivation & Objective

Motivate unified learning for diverse vision-and-language tasks to leverage shared grounding and reasoning capabilities.
Develop a scalable multi-task training regime that copes with dataset size and difficulty disparities.
Demonstrate that joint training yields competitive or better performance than independent single-task models while drastically reducing parameters.
Show that multi-task pretraining benefits downstream single-task finetuning and can achieve state-of-the-art results on several tasks.

Proposed method

Adopt ViLBERT as a shared trunk with task-specific heads for 12 datasets across four task groups.
Introduce a task token (per dataset) to condition the model on the current task during multi-task training.
Use a round-robin batch sampling scheme with dynamic stop-and-go (DSG) to manage training across tasks of varying size and difficulty.
Pretrain on Conceptual Caption with improved masking strategies to reduce leakage and noise in negative samples.
Fine-tune multi-task model on individual tasks and compare against fully task-specific baselines.
Provide ablations on task token granularity and training schedules to validate design choices.

Experimental results

Research questions

RQ1Can a single model trained on multiple vision-and-language tasks outperform or match independently trained task-specific models?
RQ2Does joint multi-task training provide benefits as a pretraining step for downstream single-task models?
RQ3What data- and task-level factors influence positive or negative transfer between V&L tasks?
RQ4How should multi-task training be scheduled to handle dataset size disparities and prevent overfitting or forgetting?
RQ5Does task-token design impact cross-task generalization and grounding consistency?

Key findings

A single model trained on 12 datasets outperforms or matches task-specific state-of-the-art on 11 of 12 tasks and increases the average score by 2.05 points while reducing parameters from ~3B to 270M.
Multi-task pretraining followed by single-task finetuning yields substantial gains, achieving state-of-the-art in several tasks.
Multi-task training acts as effective pretraining and improves cross-task grounding consistency, as shown by higher grounding-aware metrics when finetuned.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.