QUICK REVIEW

[论文解读] OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Peng Wang, Yang An|arXiv (Cornell University)|Feb 7, 2022

Multimodal Machine Learning Applications被引用 258

一句话总结

OFA 提出一个任务无关和模态无关的 Seq2Seq 框架，统一了多模态与单模态预训练中的架构、任务与模态，在若干视觉-语言任务上达到 SOTA，并且公开了代码。

ABSTRACT

In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization. We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks, including image generation, visual grounding, image captioning, image classification, language modeling, etc., in a simple sequence-to-sequence learning framework. OFA follows the instruction-based learning in both pretraining and finetuning stages, requiring no extra task-specific layers for downstream tasks. In comparison with the recent state-of-the-art vision & language models that rely on extremely large cross-modal datasets, OFA is pretrained on only 20M publicly available image-text pairs. Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni-modal tasks. Our further analysis indicates that OFA can also effectively transfer to unseen tasks and unseen domains. Our code and models are publicly available at https://github.com/OFA-Sys/OFA.

研究动机与目标

pursuit a unified, task-agnostic and modality-agnostic multimodal pretraining paradigm.
Eliminate task-specific heads and adapters to enable zero-shot and cross-domain transfer.
Unify a broad spectrum of tasks (generation and understanding) across vision, language, and cross-modality in a single framework.
Demonstrate competitive or state-of-the-art performance on cross-modal and unimodal benchmarks with relatively modest data.

提出的方法

Represent diverse modalities in a shared, token-based vocabulary using image codes, region tokens, and BPE text tokens.
Use a Transformer encoder-decoder as a single architecture for pretraining, finetuning, and inference across all tasks.
Form all pretraining and downstream tasks as sequence-to-sequence generation with handcrafted instructions to provide task guidance.
Pretrain on 20M public image-text pairs with multitask objectives including visual grounding, grounded captioning, image-text matching, image captioning, VQA, object detection, and image infilling, plus language text infilling for pure NLP tasks.
Introduce a Trie-based decoding strategy to improve efficiency and accuracy for classification-like outputs.

实验结果

研究问题

RQ1Can a single Seq2Seq model with a unified instruction representation handle both unimodal and multimodal tasks across vision and language?
RQ2Does removing task-specific heads/adapters and enforcing modality-agnostic representations enable effective zero-shot and cross-domain transfer?
RQ3How does multitask pretraining with diverse vision-language tasks impact downstream performance on VQA, captioning, grounding, and unimodal benchmarks?
RQ4What are the trade-offs of a smaller versus larger OFA model in terms of cross-modal and unimodal performance?
RQ5To what extent can OFA transfer to unseen tasks/domains without finetuning?

主要发现

OFA 在 VQA test-std 上达到 82.0，在 SNLI-VE test 集上达到 91.0/91.2，超越了以往在跨模态理解任务上的 SOTA。
在 MSCOCO 图像描述（Karpathy 别分法）中，OFA 达到 CIDEr 154.9（CIDEr 优化），超过了先前的 SOTA 方法如 SimVLM Huge 和 LEMON。
对于指代表达理解，OFA 获得显著提升：RefCOCO testA 90.67，RefCOCO+ testA 87.68，RefCOCOg test-u 88.78，超越先前的 SOTA 的若干点。
在文本到图像生成方面，OFA 实现 FID 10.5，CLIPSIM 34.4，IS 31.1，优于 CogView 与 NÜWA，且采样规模更小。
单模态任务表现具有竞争力：GLUE（SST-2、RTE、MRPC、QQP、QNLI、MNLI）和 Gigaword 摘要生成在分 modality 的基线接近或超越多项 SOTA 模型，且 OFA Large 的 ImageNet-1K 微调准确率达到 85.6%。
零-shot 学习在 GLUE 任务和 SNLI-VE 上展现竞争力，并对未见任务如在跨域图像中的 grounding QA 与 VQA 有显著迁移效果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。