Skip to main content
QUICK REVIEW

[論文レビュー] Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Jiasen Lu, Christopher M. Clark|arXiv (Cornell University)|Jun 17, 2022
Multimodal Machine Learning Applications被引用数 110
ひとこと要約

tldr: Unified-IO は、視覚・言語・視覚言語タスクを横断して 95 データセットを統合する単一のトランスフォーマー型エンコーダ-デコーダであり、出入力をすべて離散トークン列に変換することにより、タスク固有ヘッドを用いず GRIT 7 タスクカバレッジのような最先端の性能を実現します。

ABSTRACT

We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and referring expression, to natural language processing tasks such as question answering and paraphrasing. Developing a single unified model for such a large variety of tasks poses unique challenges due to the heterogeneous inputs and outputs pertaining to each task, including RGB images, per-pixel maps, binary masks, bounding boxes, and language. We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens. This common representation across all tasks allows us to train a single transformer-based architecture, jointly on over 90 diverse datasets in the vision and language fields. Unified-IO is the first model capable of performing all 7 tasks on the GRIT benchmark and produces strong results across 16 diverse benchmarks like NYUv2-Depth, ImageNet, VQA2.0, OK-VQA, Swig, VizWizGround, BoolQ, and SciTail, with no task-specific fine-tuning. Code and demos for Unified-IO are available at: https://unified-io.allenai.org.

研究の動機と目的

  • Motivate building a single unified model across vision, language, and multi-modal tasks to enable broad capability and transfer.
  • Propose a token-level, modality-agnostic representation to allow a single transformer to handle diverse outputs like boxes, masks, depth maps, and text.
  • Demonstrate that joint multi-task training on 95 datasets can yield strong performance across 7 GRIT tasks and 16 benchmarks without task-specific fine-tuning.
  • Show ablations to understand how task groups influence learning and transfer across concepts.

提案手法

  • Represent all inputs and outputs as discrete tokens within a unified vocabulary (text tokens, 1000 location tokens, 16384 vision tokens).
  • Encode dense outputs (images, depth, segmentation) as VQ-GAN tokens; encode sparse outputs (boxes, joints) as coordinate tokens; encode language with SentencePiece and prompts.
  • Use an encoder-decoder Transformer akin to T5, with 2D relative embeddings and absolute position embeddings to handle images.
  • Pre-train with text span denoising and masked image denoising on a mix of vision, language, and V&L data.
  • Train a single model jointly on 95 datasets (62 sources) across 8 groups and 22 tasks with balanced sampling within groups.
  • No task-specific heads; train in two stages: pre-training and large-scale multi-task training; evaluate on GRIT and 16 other benchmarks.

実験結果

リサーチクエスチョン

  • RQ1Can a single Seq2Seq model learn a broad set of vision, language, and multi-modal tasks without task-specific heads?
  • RQ2How well does a massively multi-task trained model generalize to new concepts and unseen datasets?
  • RQ3What is the impact of including or excluding task groups on overall performance and transfer?
  • RQ4How does prompt design affect performance on referring expressions?

主な発見

  • Unified-IO achieves top average score 64.3 on GRIT seven-task benchmark, outperforming prior SOTA by a large margin.
  • On GRIT, XL variant outperforming prior models on localization, segmentation, and other tasks, with strong cross-task transfer.
  • Generalization to new concepts shows Unified-IO has smaller degradation between same and new concept splits compared to other models.
  • Across 16 additional benchmarks (NYUv2, ImageNet, VQA2.0, OK-VQA, VizWiz, Swig, BoolQ, SciTail, etc.), Unified-IO demonstrates competitive or strong performance without task-specific fine-tuning.
  • Ablation studies indicate removing task groups does not drastically hurt most tasks, highlighting robustness of the unified approach.
  • Prompt generalization case study shows referring-expression prompts can be paraphrased with varying effectiveness

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。