QUICK REVIEW

[論文レビュー] Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks

Jiasen Lu, Christopher M. Clark|arXiv (Cornell University)|Jun 17, 2022

Multimodal Machine Learning Applications被引用数 110

ひとこと要約

tldr: Unified-IO は、視覚・言語・視覚言語タスクを横断して 95 データセットを統合する単一のトランスフォーマー型エンコーダ-デコーダであり、出入力をすべて離散トークン列に変換することにより、タスク固有ヘッドを用いず GRIT 7 タスクカバレッジのような最先端の性能を実現します。

ABSTRACT

We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and referring expression, to natural language processing tasks such as question answering and paraphrasing. Developing a single unified model for such a large variety of tasks poses unique challenges due to the heterogeneous inputs and outputs pertaining to each task, including RGB images, per-pixel maps, binary masks, bounding boxes, and language. We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens. This common representation across all tasks allows us to train a single transformer-based architecture, jointly on over 90 diverse datasets in the vision and language fields. Unified-IO is the first model capable of performing all 7 tasks on the GRIT benchmark and produces strong results across 16 diverse benchmarks like NYUv2-Depth, ImageNet, VQA2.0, OK-VQA, Swig, VizWizGround, BoolQ, and SciTail, with no task-specific fine-tuning. Code and demos for Unified-IO are available at: https://unified-io.allenai.org.

研究の動機と目的

Motivate building a single unified model across vision, language, and multi-modal tasks to enable broad capability and transfer.
Propose a token-level, modality-agnostic representation to allow a single transformer to handle diverse outputs like boxes, masks, depth maps, and text.
Demonstrate that joint multi-task training on 95 datasets can yield strong performance across 7 GRIT tasks and 16 benchmarks without task-specific fine-tuning.
Show ablations to understand how task groups influence learning and transfer across concepts.

提案手法

Represent all inputs and outputs as discrete tokens within a unified vocabulary (text tokens, 1000 location tokens, 16384 vision tokens).
Encode dense outputs (images, depth, segmentation) as VQ-GAN tokens; encode sparse outputs (boxes, joints) as coordinate tokens; encode language with SentencePiece and prompts.
Use an encoder-decoder Transformer akin to T5, with 2D relative embeddings and absolute position embeddings to handle images.
Pre-train with text span denoising and masked image denoising on a mix of vision, language, and V&L data.
Train a single model jointly on 95 datasets (62 sources) across 8 groups and 22 tasks with balanced sampling within groups.
No task-specific heads; train in two stages: pre-training and large-scale multi-task training; evaluate on GRIT and 16 other benchmarks.

実験結果

リサーチクエスチョン

RQ1Can a single Seq2Seq model learn a broad set of vision, language, and multi-modal tasks without task-specific heads?
RQ2How well does a massively multi-task trained model generalize to new concepts and unseen datasets?
RQ3What is the impact of including or excluding task groups on overall performance and transfer?
RQ4How does prompt design affect performance on referring expressions?

主な発見

Unified-IO achieves top average score 64.3 on GRIT seven-task benchmark, outperforming prior SOTA by a large margin.
On GRIT, XL variant outperforming prior models on localization, segmentation, and other tasks, with strong cross-task transfer.
Generalization to new concepts shows Unified-IO has smaller degradation between same and new concept splits compared to other models.
Across 16 additional benchmarks (NYUv2, ImageNet, VQA2.0, OK-VQA, VizWiz, Swig, BoolQ, SciTail, etc.), Unified-IO demonstrates competitive or strong performance without task-specific fine-tuning.
Ablation studies indicate removing task groups does not drastically hurt most tasks, highlighting robustness of the unified approach.
Prompt generalization case study shows referring-expression prompts can be paraphrased with varying effectiveness

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。