[Paper Review] One Model To Learn Them All
The paper introduces MultiModel, a single deep model trained jointly on eight diverse tasks across vision, language, speech, and parsing, showing transfer and cross-domain benefits.
Deep learning yields great results across many fields, from speech recognition, image classification, to translation. But for each problem, getting a deep model to work well involves research into the architecture and a long period of tuning. We present a single model that yields good results on a number of problems spanning multiple domains. In particular, this single model is trained concurrently on ImageNet, multiple translation tasks, image captioning (COCO dataset), a speech recognition corpus, and an English parsing task. Our model architecture incorporates building blocks from multiple domains. It contains convolutional layers, an attention mechanism, and sparsely-gated layers. Each of these computational blocks is crucial for a subset of the tasks we train on. Interestingly, even if a block is not crucial for a task, we observe that adding it never hurts performance and in most cases improves it on all tasks. We also show that tasks with less data benefit largely from joint training with other tasks, while performance on large tasks degrades only slightly if at all.
Motivation & Objective
- Motivate the creation of a unified deep learning model that can handle tasks across multiple domains without task-specific architectures.
- Propose a MultiModel architecture that combines modality-specific nets with a shared body using diverse computational blocks.
- Demonstrate learning across eight corpora and analyze how shared blocks transfer across tasks and data scales.
- Investigate the impact of joint training versus single-task training and the necessity of attention and mixture-of-experts blocks.
Proposed method
- Introduce modality nets to map inputs from different modalities into a shared representation space.
- Use a body comprising convolutional, attention, and sparsely-gated mixture-of-experts blocks to process and generate outputs.
- Employ an autoregressive, fully convolutional encoder–m mixer–decoder framework similar to ByteNet/WaveNet but with cross-domain blocks.
- Train the model jointly on eight corpora: WSJ speech, ImageNet, COCO captions, WSJ parsing, and WMT EN-DE, DE-EN, EN-FR, FR-DE translations.
- Share parameters across tasks within each modality to encourage generalization and on-the-fly addition of new tasks.
Experimental results
Research questions
- RQ1How close can a single model trained on eight diverse tasks come to state-of-the-art results on individual tasks?
- RQ2How does joint eight-task training compare to training on each task separately with similar compute?
- RQ3Which computational blocks (attention, mixture-of-experts) contribute to performance across different tasks?
- RQ4Does cross-task transfer occur between seemingly unrelated domains (e.g., ImageNet and parsing) when trained jointly?
Key findings
- MultiModel with eight tasks achieves competitive results but is not state-of-the-art yet (ImageNet top-5 86%, WMT EN→DE 21.2 BLEU, WMT EN→FR 30.5 BLEU).
- Joint eight-task training performs similarly to single-task training on large tasks and can outperform on data-scarce tasks like parsing.
- Including mixture-of-experts and attention blocks generally helps or at least does not hurt performance across tasks; removing either can degrade or modestly affect performance.
- Cross-domain transfer is observed; training parsing with ImageNet or with eight tasks yields improvements over training on parsing alone.
- Tasks benefit from shared modality nets and a unified representation, enabling on-the-fly addition of new tasks and positive transfer from data-rich to data-poor tasks.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.