QUICK REVIEW

[論文レビュー] UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning

Wei Li, Can Gao|arXiv (Cornell University)|Dec 31, 2020

Multimodal Machine Learning Applications参考文献 40被引用数 32

ひとこと要約

UNIMOはテキスト、画像、画像-テキストの対を学習する統一モーダル事前学習パラダイムを導入し、クロスモーダルコントラスト学習とマルチレベルのテキスト書き換えを通じて、シングルモーダルおよびマルチモーダルの両方のタスクで高い性能を発揮します。

ABSTRACT

Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and cannot effectively adapt to each other. They can only utilize single-modal data (i.e. text or image) or limited multi-modal data (i.e. image-text pairs). In this work, we propose a unified-modal pre-training architecture, namely UNIMO, which can effectively adapt to both single-modal and multi-modal understanding and generation tasks. Large scale of free text corpus and image collections can be utilized to improve the capability of visual and textual understanding, and cross-modal contrastive learning (CMCL) is leveraged to align the textual and visual information into a unified semantic space over a corpus of image-text pairs. As the non-paired single-modal data is very rich, our model can utilize much larger scale of data to learn more generalizable representations. Moreover, the textual knowledge and visual knowledge can enhance each other in the unified semantic space. The experimental results show that UNIMO significantly improves the performance of several single-modal and multi-modal downstream tasks. Our code and pre-trained models are public at the UNIMO project page https://unimo-ptm.github.io/

研究の動機と目的

大規模な非対のテキストおよび画像データを活用できる統一モーダル事前学習アプローチを提案する。
視覚モダリティと文本モダリティを共通の意味空間に整列させる表現を学習する。
単一モーダルな言語タスクとマルチモーダルな視覚-言語タスクの両方で高い性能を実現する。
テキスト知識と視覚知識がクロスモーダル学習において互いに相互効果を及ぼすことを示す。

提案手法

テキスト、画像領域、および画像-テキスト対を処理する統一モーダル Transformer を用いる。
テキスト書き換えを用いたクロスモーダルコントラスト学習（CMCL）を適用し、画像-テキスト対の多様な正例と難しい負例を作成する。
各画像-テキスト対を、単一モーダルデータから取得した関連テキストと画像で拡張する。
マスクされた視覚特徴再構成と特徴回帰および領域分類の目的の組み合わせで事前学習する。
モダリティ間で文脈を共有しつつ、双方向の予測とSeq2Seq生成を伴う言語モデリングのための統一エンコーダ-デコーダを訓練する。

実験結果

リサーチクエスチョン

RQ11つのモデルがテキスト、画像、および画像-テキスト対から効果的に学習し、単一モーダルとマルチモーダルの両方のタスクを支援できるだろうか？
RQ2マルチグラニュラリティのテキスト書き換えを伴うクロスモーダルコントラスト学習は、統一された意味空間での整合性を改善するか？
RQ3共同学習時、テキスト知識と視覚知識はどの程度相互に高め合うことができるか？
RQ4下流タスクにおける従来の単一モーダル PLM およびマルチモーダル事前学習手法と比較して、UNIMO はどのように性能を示すか？

主な発見

モデル	Flickr30k-IR (R@1 / R@5 / R@10)	Flickr30k-TR (R@1 / R@5 / R@10)	SNLI-VE (Val)	VQA (test-dev)	CoCo Caption (BLUE4 / CIDEr)
UNIMO-base	74.66 / 93.40 / 96.08	89.70 / 98.40 / 99.10	80.00	73.79	38.8 / 124.4
UNIMO-large	78.04 / 94.24 / 97.12	89.40 / 98.90 / 99.80	81.11 / 80.63	75.06 / 75.27	39.6 / 127.7

UNIMO-base と UNIMO-large は、画像-テキスト検索、視覚推論、VQA、画像キャプションなどのマルチモーダルタスクで最先端の結果を達成し、UNIMO-large は画像とテキスト検索において従来の最高 ERNIE-ViL-large を約 1.3–1.34 R@1 上回る。
UNIMO は単一モーダル言語タスクでも高い性能を示し、いくつかの PLM を上回り、多くのベンチマークで UniLM を凌駕している。
アブレーション実験は、テキストデータを除去すると（w/o texts）マルチモーダルタスクが低下し、視覚データを除去すると（w/o pairs&images）単一モーダルタスクが低下することを示し、モダリティ間の相互強化を示している。
画像-テキスト対のみで訓練されたモデルより、非対のテキストと画像データを画像-テキスト対と共に用いることで、よりリッチな表現とクロスモーダル整合性を得る。
テキスト書き換え（文・語句・単語レベル）と retrieved augmentation を用いた CMCL は、素朴な画像-テキストマッチング手法よりも、クロスモーダル意味的整合性を大幅に改善する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。