QUICK REVIEW

[論文レビュー] Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

Gen Li, Nan Duan|arXiv (Cornell University)|Aug 16, 2019

Multimodal Machine Learning Applications参考文献 40被引用数 117

ひとこと要約

Unicoder-VLは多層 Transformerを事前学習し、三つのクロスモーダル目的を用いて視覚と言語表現を結合した表現を学習します。ファインチューニング後、強力な画像-テキスト検索と競争力のあるビジュアル常識推論を実現します。

ABSTRACT

We propose Unicoder-VL, a universal encoder that aims to learn joint representations of vision and language in a pre-training manner. Borrow ideas from cross-lingual pre-trained models, such as XLM and Unicoder, both visual and linguistic contents are fed into a multi-layer Transformer for the cross-modal pre-training, where three pre-trained tasks are employed, including Masked Language Modeling (MLM), Masked Object Classification (MOC) and Visual-linguistic Matching (VLM). The first two tasks learn context-aware representations for input tokens based on linguistic and visual contents jointly. The last task tries to predict whether an image and a text describe each other. After pretraining on large-scale image-caption pairs, we transfer Unicoder-VL to caption-based image-text retrieval and visual commonsense reasoning, with just one additional output layer. We achieve state-of-the-art or comparable results on both two tasks and show the powerful ability of the cross-modal pre-training.

研究の動機と目的

視覚文脈を持つ長い言語シーケンスを扱える普遍的なクロスモーダルエンコーダの動機づけ
大規模な画像キャプションデータを活用して、クロスモーダル事前学習を通じて共同表現を学習する。
視覚と言語モダリティを整合させる3つのクロスモーダル事前学習タスクを設計・評価する。

提案手法

BERTから初期化された多層 Transformer を用いて視覚領域特徴と語彙トークンを統合する。
画像領域埋め込みと位置特徴を注入し、テキストトークンと共に結合してエンコードする。
三つの目的で事前学習する：masked language modeling (MLM)、masked object classification (MOC)、そして Visual-linguistic Matching (VLM)。
MLMは周囲のテキストと全ての画像領域を用いてマスクされた語を予測する。
MOCはマスクされた視覚領域の物体カテゴリを予測する。
VLMは画像-テキストペアが互いを説明しているかを判断する二値予測器を訓練する。

実験結果

リサーチクエスチョン

RQ1単一のTransformerベースのエンコーダは、画像キャプションデータから堅牢なクロスモーダル表現を学習できるか？
RQ2クロスモーダル事前学習目的は下流の画像-テキスト検索および視覚的常識推論を改善するか？
RQ3モデルのスケーリングと事前学習データ量はクロスモーダル転移性能にどう影響するか？

主な発見

事前学習済みのUnicoder-VLはMSCOCOおよびFlickr30Kでファインチューニング後、画像-テキスト検索のベンチマークで最先端の結果を達成する。
Unicoder-VLのゼロショット検索は、タスク固有のファインチューニングなしで一般的なクロスモーダルグラウンディングを示す。
Unicoder-VLはVisual Commonsense Reasoning (VCR)で競争力のある結果を示し、認知タスクに対するクロスモーダル事前学習の利点を示唆する。
Transformerの深さと事前学習データ量の増加に伴いモデル性能が向上する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。