QUICK REVIEW

[論文レビュー] VL-BERT: Pre-training of Generic Visual-Linguistic Representations

Weijie Su, Xizhou Zhu|arXiv (Cornell University)|Aug 22, 2019

Multimodal Machine Learning Applications参考文献 45被引用数 782

ひとこと要約

VL-BERT は画像キャプションデータとテキストコーパスで事前学習された統一視覚-言語 Transformer を導入し、単一モデルのエンドツーエンドアプローチで VCR、VQA、referring expression タスクにおいて最先端の結果を達成します。

ABSTRACT

We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short). VL-BERT adopts the simple yet powerful Transformer model as the backbone, and extends it to take both visual and linguistic embedded features as input. In it, each element of the input is either of a word from the input sentence, or a region-of-interest (RoI) from the input image. It is designed to fit for most of the visual-linguistic downstream tasks. To better exploit the generic representation, we pre-train VL-BERT on the massive-scale Conceptual Captions dataset, together with text-only corpus. Extensive empirical analysis demonstrates that the pre-training procedure can better align the visual-linguistic clues and benefit the downstream tasks, such as visual commonsense reasoning, visual question answering and referring expression comprehension. It is worth noting that VL-BERT achieved the first place of single model on the leaderboard of the VCR benchmark. Code is released at \url{https://github.com/jackroos/VL-BERT}.

研究の動機と目的

複数の下流タスクに微調整可能な、汎用の事前学習可能な視覚-言語表現を開発する。
視覚 RoI 特徴と言語入力を、柔軟なモダル間アテンションを備えた単一の Transformer ボトムに統合する。
視覚-言語データとテキストのみのコーパスを大規模に事前学習して、視覚と言語の手掛かりを整合させ、一般化を向上させる。
単一モデルで VCR、VQA、Referring expression comprehension における最先端性能を実証する。

提案手法

Transformer アーキテクチャを拡張し、単一の逐次として語と RoI 入力の双方を処理する。
入力を token、visual feature、segment、position embedding で表現し、RoI 用の新しい視覚特徴埋め込みを含める。
視覚-言語データで2つのタスクで事前学習する：視覚的手掛かりを用いたMasked Language Modeling および言語的手掛かりを用いたMasked RoI Classification。
Conceptual Captions（視覚-言語）と BooksCorpus/Wikipedia（テキストコーパス）を1:1のサンプリング混合で事前学習。
下流タスク向けに、タスク固有の入力/出力形式を用いてエンドツーエンドでファインチューニングする（例：<Question, Answer, Image>、<Caption, Image>）。

実験結果

リサーチクエスチョン

RQ1単一の統一された Transformer ベースのモデルは、複数タスクにわたって視覚と言語の表現を効果的に学習・整合させることができるか？
RQ2視覚-言語データとテキストのみデータの共同事前学習は、単一ドメインの事前学習と比較して下流の視覚-言語タスクの性能を改善するか？
RQ3 MLM に視覚的手掛かりを組み込み RoI 分類を組み込むことが、VCR、VQA、RefCOCO+ のような下流タスクに与える影響は？
RQ4事前学習済みの VL-BERT モデルは、単一モデルのアーキテクチャで多様なベンチマークにおいて最先端結果を達成できるか？

主な発見

VL-BERT は単一の統一モデルで複数の視覚-言語タスクにおいて高い性能を達成する。
視覚-言語データでの事前学習は、最終的な VCR タスク（Q→AR）で非事前学習ベースラインと比べ約1.0ポイントの改善をもたらす。
VL-BERT LARGE は競争力のある結果を達成：VCR val Q→A 75.5、QA→R 75.8；test Q→A 77.9、test QA→R 78.4；RefCOCO+ val 80.31、testA 83.62、testB 75.45；VQA test-dev 71.79、test-std 72.22。
VQA では VL-BERT BASE/LARGE が非事前学習ベースラインを上回り、一部の同時期手法を単一モデル設定で凌駕する（例：Large は test-dev で 71.79、test-std で 72.22）。
RefCOCO+ では VL-BERT LARGE が強力な結果を示す（testA 83.62、testB 62.30、検出された領域を使用）。
VL-BERT は公開時点での単一モデルアプローチの中で、視覚的常識推論（VCR）において最先端の性能を示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。