[论文解读] InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining
InterBERT 引入了一个单流交互机制和一个双流提取模块用于视觉语言预训练,以及 MGM 和 ITM-hn 预训练任务。它在图像检索和 VCR 上优于基线,同时保持强的单模态性能,并在淘宝上线部署。
Multi-modal pretraining for learning high-level multi-modal representation is a further step towards deep learning and artificial intelligence. In this work, we propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6 (MultiModality-to-MultiModality Multitask Mega-transformer). The model owns strong capability of modeling interaction between the information flows of different modalities. The single-stream interaction module is capable of effectively processing information of multiple modalilties, and the two-stream module on top preserves the independence of each modality to avoid performance downgrade in single-modal tasks. We pretrain the model with three pretraining tasks, including masked segment modeling (MSM), masked region modeling (MRM) and image-text matching (ITM); and finetune the model on a series of vision-and-language downstream tasks. Experimental results demonstrate that InterBERT outperforms a series of strong baselines, including the most recent multi-modal pretraining methods, and the analysis shows that MSM and MRM are effective for pretraining and our method can achieve performances comparable to BERT in single-modal tasks. Besides, we propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model. We pretrain the Chinese InterBERT on our proposed dataset of 3.1M image-text pairs from the mobile Taobao, the largest Chinese e-commerce platform. We finetune the model for text-based image retrieval, and recently we deployed the model online for topic-based recommendation.
研究动机与目标
- 通过实现强跨模态交互,推动超越简单的 MLM/MOM 的鲁棒多模态表征学习。
- 设计 InterBERT,包含单流全交互模块与双流提取模块,以保持模态独立性。
- 引入预训练任务(Masked Group Modeling 与 Hard-Negative Image-Text Matching),提升跨模态理解。
- 在下游视觉-语言任务(基于描述的检索与 VCR)上评估,并分析单模态迁移性及初始化效应。
- 通过线上淘宝部署与 A/B 测试来展示部署潜力。
提出的方法
- 使用单流全注意力交互模块融合图像与文本嵌入。
- 实现双流提取模块,产生用于下游的模态特定表示。
- 使用 Masked Group Modeling(文本的 MSM,图像的 MRM)与 Image-Text Matching with Hard Negatives(ITM-hn)进行预训练。
- MSM 将连续文本片段进行掩码;MRM 将与锚点 IoU 高的图像区域进行掩码。
- ITM-hn 使用通过 TF-IDF 检索的困难负样本来创建有挑战性的图像-文本对。
- 在下游任务上微调,如基于描述的图像检索、零样本检索,以及 Visual Commonsense Reasoning(VCR)。
实验结果
研究问题
- RQ1多模态预训练模型在保留模态独立性的同时,是否可以受益于统一的全注意力交互?
- RQ2MGM 和 ITM-hn 的预训练任务是否提升跨模态理解和下游性能?
- RQ3与 BERT 相比,InterBERT 在单模态 NLP 任务上的迁移效果如何?
- RQ4BERT 初始化对多模态预训练性能有何影响?
- RQ5相对于 VilBERT/VL-BERT,InterBERT 在标准视觉-语言基准(IR、零样本 IR、VCR)上的表现如何?
主要发现
| 模型 | IR R@1 | IR R@5 | IR R@10 | 零-shot R@1 | 零-shot R@5 | 零-shot R@10 | VCR Q→A | VCR QA→R | VCR Q→AR |
|---|---|---|---|---|---|---|---|---|---|
| SCAN (Lee et al., 2018) | 48.6 | 77.7 | 85.2 | - | - | - | - | - | - |
| R2C (Zellers et al., 2019) | - | - | - | - | - | - | 63.8 | 67.2 | 43.1 |
| VisualBERT (Li et al., 2019b) | - | - | - | - | - | - | 70.8 | 73.2 | 52.2 |
| VilBERT (Lu et al., 2019a) | 58.2 | 84.9 | 91.5 | 31.9 | 61.1 | 72.8 | 72.4 | 74.5 | 54.0 |
| VL-BERT (Su et al., 2019) | - | - | - | - | - | - | 73.8 | 74.4 | 54.2 |
| InterBERT (w/o pt) | 53.1 | 80.6 | 87.9 | - | - | - | 63.6 | 63.1 | 40.3 |
| InterBERT | 61.9 | 87.1 | 92.7 | 49.2 | 77.6 | 86.0 | 73.1 | 74.8 | 54.9 |
- InterBERT 在图像检索和 VCR 上优于强基线,且在零样本图像检索方面取得显著提升。
- 在 Flickr30K 基于的图像检索上,InterBERT 达到 61.9% R@1, 87.1% R@5, 92.7% R@10 (IR)。
- 在零样本图像检索中,InterBERT 达到 49.2% R@1, 77.6% R@5, 86.0% R@10。
- 对于 VCR,InterBERT 达成 73.1% Q→A, 74.8% QA→R, 54.9% Q→AR,优于 R2C 与 VilBERT 基线。
- 未进行预训练的 InterBERT 表现不如有预训练,显示多模态预训练的有效性。
- GLUE 风格结果显示 InterBERT 在 NLP 任务上与 BERT-base 相媲美,并保持与 BERT-base 相当的单模态能力。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。