[论文解读] OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
OFA 提出一个任务无关和模态无关的 Seq2Seq 框架,统一了多模态与单模态预训练中的架构、任务与模态,在若干视觉-语言任务上达到 SOTA,并且公开了代码。
In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization. We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks, including image generation, visual grounding, image captioning, image classification, language modeling, etc., in a simple sequence-to-sequence learning framework. OFA follows the instruction-based learning in both pretraining and finetuning stages, requiring no extra task-specific layers for downstream tasks. In comparison with the recent state-of-the-art vision & language models that rely on extremely large cross-modal datasets, OFA is pretrained on only 20M publicly available image-text pairs. Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni-modal tasks. Our further analysis indicates that OFA can also effectively transfer to unseen tasks and unseen domains. Our code and models are publicly available at https://github.com/OFA-Sys/OFA.
研究动机与目标
- pursuit a unified, task-agnostic and modality-agnostic multimodal pretraining paradigm.
- Eliminate task-specific heads and adapters to enable zero-shot and cross-domain transfer.
- Unify a broad spectrum of tasks (generation and understanding) across vision, language, and cross-modality in a single framework.
- Demonstrate competitive or state-of-the-art performance on cross-modal and unimodal benchmarks with relatively modest data.
提出的方法
- Represent diverse modalities in a shared, token-based vocabulary using image codes, region tokens, and BPE text tokens.
- Use a Transformer encoder-decoder as a single architecture for pretraining, finetuning, and inference across all tasks.
- Form all pretraining and downstream tasks as sequence-to-sequence generation with handcrafted instructions to provide task guidance.
- Pretrain on 20M public image-text pairs with multitask objectives including visual grounding, grounded captioning, image-text matching, image captioning, VQA, object detection, and image infilling, plus language text infilling for pure NLP tasks.
- Introduce a Trie-based decoding strategy to improve efficiency and accuracy for classification-like outputs.
实验结果
研究问题
- RQ1Can a single Seq2Seq model with a unified instruction representation handle both unimodal and multimodal tasks across vision and language?
- RQ2Does removing task-specific heads/adapters and enforcing modality-agnostic representations enable effective zero-shot and cross-domain transfer?
- RQ3How does multitask pretraining with diverse vision-language tasks impact downstream performance on VQA, captioning, grounding, and unimodal benchmarks?
- RQ4What are the trade-offs of a smaller versus larger OFA model in terms of cross-modal and unimodal performance?
- RQ5To what extent can OFA transfer to unseen tasks/domains without finetuning?
主要发现
- OFA 在 VQA test-std 上达到 82.0,在 SNLI-VE test 集上达到 91.0/91.2,超越了以往在跨模态理解任务上的 SOTA。
- 在 MSCOCO 图像描述(Karpathy 别分法)中,OFA 达到 CIDEr 154.9(CIDEr 优化),超过了先前的 SOTA 方法如 SimVLM Huge 和 LEMON。
- 对于指代表达理解,OFA 获得显著提升:RefCOCO testA 90.67,RefCOCO+ testA 87.68,RefCOCOg test-u 88.78,超越先前的 SOTA 的若干点。
- 在文本到图像生成方面,OFA 实现 FID 10.5,CLIPSIM 34.4,IS 31.1,优于 CogView 与 NÜWA,且采样规模更小。
- 单模态任务表现具有竞争力:GLUE(SST-2、RTE、MRPC、QQP、QNLI、MNLI)和 Gigaword 摘要生成在分 modality 的基线接近或超越多项 SOTA 模型,且 OFA Large 的 ImageNet-1K 微调准确率达到 85.6%。
- 零-shot 学习在 GLUE 任务和 SNLI-VE 上展现竞争力,并对未见任务如在跨域图像中的 grounding QA 与 VQA 有显著迁移效果。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。