QUICK REVIEW

[論文レビュー] InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining

Junyang Lin, Yang An|arXiv (Cornell University)|Mar 30, 2020

Multimodal Machine Learning Applications参考文献 60被引用数 56

ひとこと要約

InterBERTは、視覚言語事前学習のための単一ストリーム相互作用機構と二重ストリーム抽出モジュールを導入し、MGMとITM-hnの事前学習タスクを併用します。画像検索とVCRでベースラインを上回りつつ、単一モダリティの性能を高水準に維持し、Taobaoでの展開にもつながっています。

ABSTRACT

Multi-modal pretraining for learning high-level multi-modal representation is a further step towards deep learning and artificial intelligence. In this work, we propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6 (MultiModality-to-MultiModality Multitask Mega-transformer). The model owns strong capability of modeling interaction between the information flows of different modalities. The single-stream interaction module is capable of effectively processing information of multiple modalilties, and the two-stream module on top preserves the independence of each modality to avoid performance downgrade in single-modal tasks. We pretrain the model with three pretraining tasks, including masked segment modeling (MSM), masked region modeling (MRM) and image-text matching (ITM); and finetune the model on a series of vision-and-language downstream tasks. Experimental results demonstrate that InterBERT outperforms a series of strong baselines, including the most recent multi-modal pretraining methods, and the analysis shows that MSM and MRM are effective for pretraining and our method can achieve performances comparable to BERT in single-modal tasks. Besides, we propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model. We pretrain the Chinese InterBERT on our proposed dataset of 3.1M image-text pairs from the mobile Taobao, the largest Chinese e-commerce platform. We finetune the model for text-based image retrieval, and recently we deployed the model online for topic-based recommendation.

研究の動機と目的

単純な MLM/MOM を超えた堅牢な多モダル表現学習を、強力なクロスモーダル相互作用を可能にすることで動機づける。
単一ストリーム相互作用モジュールと二重ストリーム抽出モジュールを設計して、モダリティの独立性を保つ。
クロスモーダル理解を高めるために、Masked Group Modeling（テキスト用 MSM、画像用 MRM）と Image-Text Matching with Hard Negatives（ITM-hn）という事前学習タスクを導入する。
下流タスク（キャプションベースの画像検索、ゼロショット検索、Visual Commonsense Reasoning(VCR)）で評価し、単一モダル転移性と初期化効果を分析する。
オンライン Taobao 展開と A/B テストを通じた展開可能性を示す。

提案手法

画像とテキストの埋め込みを統合する単一ストリーム全注意機構を用いる。
下流用途のためのモダリティ固有表現を生み出す二重ストリーム抽出モジュールを実装する。
Masked Group Modeling（テキストは MSM、画像は MRM）とHard Negativesを用いたImage-Text Matching（ITM-hn）で事前学習する。
MSMは連続するテキスト断片をマスクし、MRMはアンカーとのIoUが高い画像領域をマスクする。
ITM-hnはTF-IDFで取得された難易度の高いネガティブを用いて挑戦的な画像-テキストペアを作成する。
キャプションベースの画像検索、ゼロショット検索、VCR などの下流タスクにファインチューニングする。

実験結果

リサーチクエスチョン

RQ1統一された全注意相互作用を維持しつつ、モダリティ独立性を保ちながら、マルチモーダル事前学習モデルは恩恵を受けられるか。
RQ2MGM と ITM-hn の事前学習タスクは、クロスモーダル理解と下流性能を向上させるか。
RQ3InterBERTはBERTと比較して、単一モダリティのNLPタスクへどれだけ転移できるか。
RQ4BERT初期化がマルチモーダル事前学習の性能に与える影響は何か。
RQ5InterBERTは VilBERT/VL-BERT と比較して、標準的な視覚言語ベンチマーク（IR、ゼロショットIR、VCR）でどのように性能を示すか。

主な発見

Model	IR R@1	IR R@5	IR R@10	Zero-shot R@1	Zero-shot R@5	Zero-shot R@10	VCR Q→A	VCR QA→R	VCR Q→AR
SCAN (Lee et al., 2018)	48.6	77.7	85.2	-	-	-	-	-	-
R2C (Zellers et al., 2019)	-	-	-	-	-	-	63.8	67.2	43.1
VisualBERT (Li et al., 2019b)	-	-	-	-	-	-	70.8	73.2	52.2
VilBERT (Lu et al., 2019a)	58.2	84.9	91.5	31.9	61.1	72.8	72.4	74.5	54.0
VL-BERT (Su et al., 2019)	-	-	-	-	-	-	73.8	74.4	54.2
InterBERT (w/o pt)	53.1	80.6	87.9	-	-	-	63.6	63.1	40.3
InterBERT	61.9	87.1	92.7	49.2	77.6	86.0	73.1	74.8	54.9

InterBERTは画像検索とVCRで強力なベースラインを上回り、ゼロショット画像検索で顕著な向上を示す。
Flickr30Kベースの画像検索で、InterBERTは61.9% R@1、87.1% R@5、92.7% R@10（IR）を達成。
ゼロショット画像検索では、InterBERTは49.2% R@1、77.6% R@5、86.0% R@10を達成。
VCRでは、InterBERTは73.1% Q→A、74.8% QA→R、54.9% Q→ARを達成し、R2CとVilBERTのベースラインを上回る。
事前学習なしのInterBERTは、事前学習ありと比べて劣っており、多モダル事前学習の有効性を示す。
GLUE風の結果はInterBERTがNLPタスクでBERT-baseと互角かそれ以上のパフォーマンスを示し、単一モダリティの能力もBERT-baseに近い。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。