[論文レビュー] Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
tldr: Unified-IO は、視覚・言語・視覚言語タスクを横断して 95 データセットを統合する単一のトランスフォーマー型エンコーダ-デコーダであり、出入力をすべて離散トークン列に変換することにより、タスク固有ヘッドを用いず GRIT 7 タスクカバレッジのような最先端の性能を実現します。
We propose Unified-IO, a model that performs a large variety of AI tasks spanning classical computer vision tasks, including pose estimation, object detection, depth estimation and image generation, vision-and-language tasks such as region captioning and referring expression, to natural language processing tasks such as question answering and paraphrasing. Developing a single unified model for such a large variety of tasks poses unique challenges due to the heterogeneous inputs and outputs pertaining to each task, including RGB images, per-pixel maps, binary masks, bounding boxes, and language. We achieve this unification by homogenizing every supported input and output into a sequence of discrete vocabulary tokens. This common representation across all tasks allows us to train a single transformer-based architecture, jointly on over 90 diverse datasets in the vision and language fields. Unified-IO is the first model capable of performing all 7 tasks on the GRIT benchmark and produces strong results across 16 diverse benchmarks like NYUv2-Depth, ImageNet, VQA2.0, OK-VQA, Swig, VizWizGround, BoolQ, and SciTail, with no task-specific fine-tuning. Code and demos for Unified-IO are available at: https://unified-io.allenai.org.
研究の動機と目的
- Motivate building a single unified model across vision, language, and multi-modal tasks to enable broad capability and transfer.
- Propose a token-level, modality-agnostic representation to allow a single transformer to handle diverse outputs like boxes, masks, depth maps, and text.
- Demonstrate that joint multi-task training on 95 datasets can yield strong performance across 7 GRIT tasks and 16 benchmarks without task-specific fine-tuning.
- Show ablations to understand how task groups influence learning and transfer across concepts.
提案手法
- Represent all inputs and outputs as discrete tokens within a unified vocabulary (text tokens, 1000 location tokens, 16384 vision tokens).
- Encode dense outputs (images, depth, segmentation) as VQ-GAN tokens; encode sparse outputs (boxes, joints) as coordinate tokens; encode language with SentencePiece and prompts.
- Use an encoder-decoder Transformer akin to T5, with 2D relative embeddings and absolute position embeddings to handle images.
- Pre-train with text span denoising and masked image denoising on a mix of vision, language, and V&L data.
- Train a single model jointly on 95 datasets (62 sources) across 8 groups and 22 tasks with balanced sampling within groups.
- No task-specific heads; train in two stages: pre-training and large-scale multi-task training; evaluate on GRIT and 16 other benchmarks.
実験結果
リサーチクエスチョン
- RQ1Can a single Seq2Seq model learn a broad set of vision, language, and multi-modal tasks without task-specific heads?
- RQ2How well does a massively multi-task trained model generalize to new concepts and unseen datasets?
- RQ3What is the impact of including or excluding task groups on overall performance and transfer?
- RQ4How does prompt design affect performance on referring expressions?
主な発見
- Unified-IO achieves top average score 64.3 on GRIT seven-task benchmark, outperforming prior SOTA by a large margin.
- On GRIT, XL variant outperforming prior models on localization, segmentation, and other tasks, with strong cross-task transfer.
- Generalization to new concepts shows Unified-IO has smaller degradation between same and new concept splits compared to other models.
- Across 16 additional benchmarks (NYUv2, ImageNet, VQA2.0, OK-VQA, VizWiz, Swig, BoolQ, SciTail, etc.), Unified-IO demonstrates competitive or strong performance without task-specific fine-tuning.
- Ablation studies indicate removing task groups does not drastically hurt most tasks, highlighting robustness of the unified approach.
- Prompt generalization case study shows referring-expression prompts can be paraphrased with varying effectiveness
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。