[Paper Review] 12-in-1: Multi-Task Vision and Language Representation Learning
The paper presents a single ViLBERT-based model trained jointly on 12 vision-and-language datasets across four task groups, achieving competitive or superior results while reducing parameters and enabling effective multi-task pretraining for downstream single-task finetuning.
Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly. In this work, we investigate these relationships between vision-and-language tasks by developing a large-scale, multi-task training regime. Our approach culminates in a single model on 12 datasets from four broad categories of task including visual question answering, caption-based image retrieval, grounding referring expressions, and multi-modal verification. Compared to independently trained single-task models, this represents a reduction from approximately 3 billion parameters to 270 million while simultaneously improving performance by 2.05 points on average across tasks. We use our multi-task framework to perform in-depth analysis of the effect of joint training diverse tasks. Further, we show that finetuning task-specific models from our single multi-task model can lead to further improvements, achieving performance at or above the state-of-the-art.
Motivation & Objective
- Motivate unified learning for diverse vision-and-language tasks to leverage shared grounding and reasoning capabilities.
- Develop a scalable multi-task training regime that copes with dataset size and difficulty disparities.
- Demonstrate that joint training yields competitive or better performance than independent single-task models while drastically reducing parameters.
- Show that multi-task pretraining benefits downstream single-task finetuning and can achieve state-of-the-art results on several tasks.
Proposed method
- Adopt ViLBERT as a shared trunk with task-specific heads for 12 datasets across four task groups.
- Introduce a task token (per dataset) to condition the model on the current task during multi-task training.
- Use a round-robin batch sampling scheme with dynamic stop-and-go (DSG) to manage training across tasks of varying size and difficulty.
- Pretrain on Conceptual Caption with improved masking strategies to reduce leakage and noise in negative samples.
- Fine-tune multi-task model on individual tasks and compare against fully task-specific baselines.
- Provide ablations on task token granularity and training schedules to validate design choices.
Experimental results
Research questions
- RQ1Can a single model trained on multiple vision-and-language tasks outperform or match independently trained task-specific models?
- RQ2Does joint multi-task training provide benefits as a pretraining step for downstream single-task models?
- RQ3What data- and task-level factors influence positive or negative transfer between V&L tasks?
- RQ4How should multi-task training be scheduled to handle dataset size disparities and prevent overfitting or forgetting?
- RQ5Does task-token design impact cross-task generalization and grounding consistency?
Key findings
- A single model trained on 12 datasets outperforms or matches task-specific state-of-the-art on 11 of 12 tasks and increases the average score by 2.05 points while reducing parameters from ~3B to 270M.
- Multi-task pretraining followed by single-task finetuning yields substantial gains, achieving state-of-the-art in several tasks.
- Multi-task training acts as effective pretraining and improves cross-task grounding consistency, as shown by higher grounding-aware metrics when finetuned.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.