QUICK REVIEW

[논문 리뷰] InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining

Junyang Lin, Yang An|arXiv (Cornell University)|2020. 03. 30.

Multimodal Machine Learning Applications참고 문헌 60인용 수 56

한 줄 요약

InterBERT는 비전-언어 사전학습에서 두 스트림 추출 모듈과 MGM 및 ITM-hn 사전학습 작업을 갖춘 단일 스트림 상호작용 메커니즘을 도입합니다. 이는 이미지 검색 및 VCR에서 기준선을 능가하며 단일 모달 성능도 유지하고 Taobao에 배포됩니다.

ABSTRACT

Multi-modal pretraining for learning high-level multi-modal representation is a further step towards deep learning and artificial intelligence. In this work, we propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6 (MultiModality-to-MultiModality Multitask Mega-transformer). The model owns strong capability of modeling interaction between the information flows of different modalities. The single-stream interaction module is capable of effectively processing information of multiple modalilties, and the two-stream module on top preserves the independence of each modality to avoid performance downgrade in single-modal tasks. We pretrain the model with three pretraining tasks, including masked segment modeling (MSM), masked region modeling (MRM) and image-text matching (ITM); and finetune the model on a series of vision-and-language downstream tasks. Experimental results demonstrate that InterBERT outperforms a series of strong baselines, including the most recent multi-modal pretraining methods, and the analysis shows that MSM and MRM are effective for pretraining and our method can achieve performances comparable to BERT in single-modal tasks. Besides, we propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model. We pretrain the Chinese InterBERT on our proposed dataset of 3.1M image-text pairs from the mobile Taobao, the largest Chinese e-commerce platform. We finetune the model for text-based image retrieval, and recently we deployed the model online for topic-based recommendation.

연구 동기 및 목표

단순 MLM/MOM를 넘어 강력한 교차 모달 상호작용을 가능하게 하여 강력한 다중 모달 표현 학습을 동기화합니다.
단일 스트림 상호작용 모듈과 두 스트림 추출 모듈로 모달리티 독립성을 보존하는 설계를 제시합니다.
교차 모달 이해를 향상시키기 위해 Masked Group Modeling과 Hard-Negative Image-Text Matching을 도입합니다.
다운스트림 비전-언어 작업(캡션 기반 검색 및 VCR)에서 평가하고 단일 모달 전이성 및 초기화 효과를 분석합니다.
온라인 Taobao 배포 및 A/B 테스트를 통한 배포 가능성을 입증합니다.

제안 방법

이미지와 텍스트 임베딩을 융합하기 위해 단일 스트림 전체 어텐션 상호작용 모듈을 사용합니다.
다운스트림 사용을 위해 모달리티별 표현을 생성하는 두 스트림 추출 모듈을 구현합니다.
Masked Group Modeling(MSM for text, MRM for image) 및 Image-Text Matching with Hard Negatives(ITM-hn)로 사전학습합니다.
MSM은 연속 텍스트 세그먼트를 마스킹하고; MRM은 고 IoU를 가진 이미지 영역을 앵커로 마스킹합니다.
ITM-hn은 TF-IDF를 통해 회수된 하드 네거티브를 사용하여 도전적인 이미지-텍스트 쌍을 만듭니다.
캡션 기반 이미지 검색, 제로샷 검색, Visual Commonsense Reasoning(VCR) 등 다운스트림 작업에 대해 미세조정합니다.

실험 결과

연구 질문

RQ1다중 모달 사전학습 모델이 모든-어텐션 상호작용으로 이점을 얻으면서 모달리티 독립성을 보존할 수 있는가?
RQ2MGM과 ITM-hn 사전학습 작업이 교차 모달 이해와 다운스트림 성능을 향상시키는가?
RQ3InterBERT가 단일 모달 NLP 작업으로 BERT와 비교해 얼마나 잘 전이되는가?
RQ4멀티모달 사전학습 성능에 대한 BERT 초기화의 영향은 무엇인가?
RQ5InterBERT가 VilBERT/VL-BERT 대비 표준 비전-언어 벤치마크에서 어떤 성능을 보이는가?

주요 결과

InterBERT는 이미지 검색 및 VCR에서 강력한 베이스라인을 능가하며 제로샷 이미지 검색에서 두드러진 이득을 보입니다.
Flickr30K 기반 이미지 검색에서 InterBERT는 61.9% R@1, 87.1% R@5, 92.7% R@10(IR)을 달성합니다.
제로샷 이미지 검색에서 InterBERT는 49.2% R@1, 77.6% R@5, 86.0% R@10를 달성합니다.
VCR에 대해 InterBERT는 73.1% Q→A, 74.8% QA→R, 54.9% Q→AR로 R2C 및 VilBERT 베이스라인을 능가합니다.
사전학습 없는 InterBERT는 사전학습 있는 버전에 비해 성능이 떨어지며 다중 모달 사전학습의 효과를 보여줍니다.
GLUE-스타일 결과에서 InterBERT는 NLP 작업에서 BERT-base에 근접하며 단일 모달 역량은 BERT-base에 비해 유사합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.