QUICK REVIEW

[論文レビュー] Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Zi-Yi Dou, Aishwarya Kamath|arXiv (Cornell University)|Jun 15, 2022

Multimodal Machine Learning Applications被引用数 67

ひとこと要約

Fiberはビジョンとテキストのバックボーンにクロスアテンションを挿入し、モダリティをバックボーン内で融合させ、画像レベルと領域レベルのビジョン言語タスクの両方を支える二段階の粗い-to-fine事前学習を使用します。

ABSTRACT

Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones, bringing gains in terms of memory and performance. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is available at https://github.com/microsoft/FIBER.

研究の動機と目的

統一的な画像レベルと領域レベルのタスクの双方を扱うビジョン-言語事前学習フレームワークの提案動機。
効率と性能向上のためにバックボーン内部にマルチモーダル融合を埋め込むアーキテクチャの開発。
最初は画像-テキストデータを活用し、次に画像-テキスト-ボックスデータを用いて grounding/検出を行う二段階の事前学習戦略の提案。

提案手法

学習可能な融合ゲートを備えたクロスアテンションブロックを画像とテキストのバックボーンの双方に挿入し、深いマルチモーダル融合を可能にする。
融合がオフのときは低解像度画像とITM、MLM、ITCの目的を用いた粗粒度の事前学習段階を、融合がオンのときはITM/MLM/ITCを用いる。
高解像度画像とSwinバックボーン、物体検出ヘッド（Dynamic Head）を用いた細粒度の事前学習段階を用い、領域レベルの grounding と検出を学習する。
両方のデータタイプを効率的に活用するため、ステージ間でパラメータを共有する二段階の事前学習パラダイムを採用する。
クロスアテンションを切り替える（デュアルエンコーダー対Fusionエンコーダー）ことで下流タスクにFiberを適応し、 grounding/detection のためにODヘッドを使用する。

実験結果

リサーチクエスチョン

RQ1単一のモデルアーキテクチャは、画像レベルのタスク（VQA、キャプション、検索）と領域レベルのタスク（ grounding、物体検出）を効果的に両立して実行できるのか？
RQ2バックボーン内でモダリティを融合する（ポスト融合層ではなく）ことが、VLタスク全体にわたるメモリと性能の利点をもたらすか？
RQ3粗い-to細かいの二段階事前学習パラダイムは、 groundingと検出のデータ効率と性能を向上させつつ、キャプション生成とVQAを引き続きサポートするか？
RQ4従来のエンドツーエンドおよび領域指向モデルと比較して、標準的なVLベンチマークでFiberはどのように性能を発揮するか？

主な発見

FiberはVQA、キャプション、検索、 grounding、物体検出を含むVLタスク全般で一貫した改善を達成する。
バックボーン内蔵の融合設計は融合に約2600万の追加パラメータを使用し、いくつかのベースラインよりFLOPが低く、より効率的な学習を可能にする。
直接比較では、Fiberはしばしばはるかに多量のデータや大規模モデルで学習した方法を上回ることが多い。
粗粒度事前学習を用いた二段階の粗い-to細かい事前学習は、境界ボックスデータが限られている場合でも高精度タスクで強力な性能を可能にする。
Fiberは融合ベースのリランキングの有無にかかわらず、画像-テキスト検索で競争力あるまたは優れた結果を達成し、標準ベンチマークで強力な grounding および検出結果を得る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。