QUICK REVIEW

[论文解读] InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining

Junyang Lin, Yang An|arXiv (Cornell University)|Mar 30, 2020

Multimodal Machine Learning Applications参考文献 60被引用 56

一句话总结

InterBERT 引入了一个单流交互机制和一个双流提取模块用于视觉语言预训练，以及 MGM 和 ITM-hn 预训练任务。它在图像检索和 VCR 上优于基线，同时保持强的单模态性能，并在淘宝上线部署。

ABSTRACT

Multi-modal pretraining for learning high-level multi-modal representation is a further step towards deep learning and artificial intelligence. In this work, we propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6 (MultiModality-to-MultiModality Multitask Mega-transformer). The model owns strong capability of modeling interaction between the information flows of different modalities. The single-stream interaction module is capable of effectively processing information of multiple modalilties, and the two-stream module on top preserves the independence of each modality to avoid performance downgrade in single-modal tasks. We pretrain the model with three pretraining tasks, including masked segment modeling (MSM), masked region modeling (MRM) and image-text matching (ITM); and finetune the model on a series of vision-and-language downstream tasks. Experimental results demonstrate that InterBERT outperforms a series of strong baselines, including the most recent multi-modal pretraining methods, and the analysis shows that MSM and MRM are effective for pretraining and our method can achieve performances comparable to BERT in single-modal tasks. Besides, we propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model. We pretrain the Chinese InterBERT on our proposed dataset of 3.1M image-text pairs from the mobile Taobao, the largest Chinese e-commerce platform. We finetune the model for text-based image retrieval, and recently we deployed the model online for topic-based recommendation.

研究动机与目标

通过实现强跨模态交互，推动超越简单的 MLM/MOM 的鲁棒多模态表征学习。
设计 InterBERT，包含单流全交互模块与双流提取模块，以保持模态独立性。
引入预训练任务（Masked Group Modeling 与 Hard-Negative Image-Text Matching），提升跨模态理解。
在下游视觉-语言任务（基于描述的检索与 VCR）上评估，并分析单模态迁移性及初始化效应。
通过线上淘宝部署与 A/B 测试来展示部署潜力。

提出的方法

使用单流全注意力交互模块融合图像与文本嵌入。
实现双流提取模块，产生用于下游的模态特定表示。
使用 Masked Group Modeling（文本的 MSM，图像的 MRM）与 Image-Text Matching with Hard Negatives（ITM-hn）进行预训练。
MSM 将连续文本片段进行掩码；MRM 将与锚点 IoU 高的图像区域进行掩码。
ITM-hn 使用通过 TF-IDF 检索的困难负样本来创建有挑战性的图像-文本对。
在下游任务上微调，如基于描述的图像检索、零样本检索，以及 Visual Commonsense Reasoning（VCR）。

实验结果

研究问题

RQ1多模态预训练模型在保留模态独立性的同时，是否可以受益于统一的全注意力交互？
RQ2MGM 和 ITM-hn 的预训练任务是否提升跨模态理解和下游性能？
RQ3与 BERT 相比，InterBERT 在单模态 NLP 任务上的迁移效果如何？
RQ4BERT 初始化对多模态预训练性能有何影响？
RQ5相对于 VilBERT/VL-BERT，InterBERT 在标准视觉-语言基准（IR、零样本 IR、VCR）上的表现如何？

主要发现

模型	IR R@1	IR R@5	IR R@10	零-shot R@1	零-shot R@5	零-shot R@10	VCR Q→A	VCR QA→R	VCR Q→AR
SCAN (Lee et al., 2018)	48.6	77.7	85.2	-	-	-	-	-	-
R2C (Zellers et al., 2019)	-	-	-	-	-	-	63.8	67.2	43.1
VisualBERT (Li et al., 2019b)	-	-	-	-	-	-	70.8	73.2	52.2
VilBERT (Lu et al., 2019a)	58.2	84.9	91.5	31.9	61.1	72.8	72.4	74.5	54.0
VL-BERT (Su et al., 2019)	-	-	-	-	-	-	73.8	74.4	54.2
InterBERT (w/o pt)	53.1	80.6	87.9	-	-	-	63.6	63.1	40.3
InterBERT	61.9	87.1	92.7	49.2	77.6	86.0	73.1	74.8	54.9

InterBERT 在图像检索和 VCR 上优于强基线，且在零样本图像检索方面取得显著提升。
在 Flickr30K 基于的图像检索上，InterBERT 达到 61.9% R@1, 87.1% R@5, 92.7% R@10 (IR)。
在零样本图像检索中，InterBERT 达到 49.2% R@1, 77.6% R@5, 86.0% R@10。
对于 VCR，InterBERT 达成 73.1% Q→A, 74.8% QA→R, 54.9% Q→AR，优于 R2C 与 VilBERT 基线。
未进行预训练的 InterBERT 表现不如有预训练，显示多模态预训练的有效性。
GLUE 风格结果显示 InterBERT 在 NLP 任务上与 BERT-base 相媲美，并保持与 BERT-base 相当的单模态能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。