QUICK REVIEW

[论文解读] InterBERT: An Effective Multi-Modal Pretraining Approach via Vision-and-Language Interaction

Junyang Lin, Yang An|arXiv (Cornell University)|Mar 30, 2020

Multimodal Machine Learning Applications参考文献 64被引用 11

一句话总结

InterBERT 提出了一种多模态预训练框架，通过单流交互模块和双流提取模块增强视觉与语言之间的交互，实现在保持单模态性能的同时实现有效的跨模态理解。它引入了掩码组建模（MGM），在图像检索和视觉推理等视觉-语言任务上提升了性能，优于强基线模型。

ABSTRACT

We propose a novel method for multi-modal pretraining, namely InterBERT (BERT for Interaction). The proposed architecture owns a strong capability of modeling interaction between the information flows of different modalities. The single-stream interaction module is capable of effectively processing information of multiple modalities, and the two-stream extraction module on top preserves the independence of each modality to avoid significant performance downgrade in single-modal tasks. The proposed pretraining task called masked group modeling (MGM) includes masked segment modeling and masked region modeling. It encourages the model to model a span or region instead of a single word or object, and it requires the model to learn from the general context. We pretrain the model with MGM and the conventional image-text matching, and finetune it on a series of vision-and-language downstream tasks, including caption-based image retrieval, zero-shot image retrieval, and visual commonsense reasoning. Experimental results demonstrate that InterBERT outperforms a series of strong baselines, including the most recent multi-modal pretraining methods. The analysis shows that the proposed MGM is effective for pretraining, and our method for multi-modal pretraining can adapt to single-modal tasks without significant performance decrease in comparison with the BERT-base model.

研究动机与目标

通过显式建模视觉与语言模态之间的交互，提升多模态表征学习。
在不降低性能的前提下，保持在单模态任务上的强性能，与以往的多模态模型不同。
开发一种预训练目标，捕捉跨跨度或区域的上下文关系，而非单个标记或对象。
评估掩码组建模（MGM）在增强下游视觉-语言任务跨模态理解方面的有效性。

提出的方法

InterBERT 使用单流交互模块融合视觉和文本特征，实现动态的跨模态注意力与交互。
双流提取模块保留了模态特异的表征，确保在单模态下游任务中的鲁棒性。
提出的掩码组建模（MGM）任务会掩码文本中的连续词段或图像中的连续区域，要求模型利用上下文重建这些被掩码的部分。
MGM 包括文本的掩码段建模和视觉特征的掩码区域建模，促进模态间的上下文推理。
模型在大规模图像-文本对上使用 MGM 和图像-文本匹配目标进行预训练。
对下游任务（如基于标题的图像检索、零样本图像检索和视觉常识推理）应用微调。

实验结果

研究问题

RQ1与标准的掩码语言建模相比，掩码组建模（MGM）在多模态预训练中如何提升性能？
RQ2统一的多模态架构是否能在无需微调的情况下保持在单模态任务上的高性能？
RQ3对模态间交互的建模在多大程度上提升了下游视觉-语言推理任务的性能？
RQ4InterBERT 在基准任务上与最先进多模态预训练方法相比表现如何？

主要发现

InterBERT 在基于标题的图像检索和零样本图像检索任务上优于强基线模型，包括近期的多模态预训练模型。
掩码组建模（MGM）目标显著提升了模型在跨模态上学习上下文表征的能力。
双流提取模块确保 InterBERT 在单模态任务上的性能与 BERT-base 相当，避免了性能下降。
分析表明，MGM 促使模型关注整体上下文，从而在视觉-语言任务上实现更好的泛化能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。