QUICK REVIEW

[Paper Review] WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training

Yuqi Huo, Manli Zhang|arXiv (Cornell University)|Mar 11, 2021

Multimodal Machine Learning Applications38 references85 citations

TL;DR

WenLan presents BriVL, a two-tower, cross-modal contrastive pre-training model built on MoCo-inspired large negative dictionaries, trained on a 30M image-text Chinese dataset, to outperform UNITER and CLIP on downstream vision-language tasks.

ABSTRACT

Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.

Motivation & Objective

Motivate robust multi-modal understanding under weak image-text correlations common in web data.
Propose a two-tower cross-modal pre-training framework (BriVL) leveraging MoCo-inspired contrastive learning.
Construct a large Chinese multi-source image-text dataset (RUC-CAS-WenLan) for pre-training.
Demonstrate BriVL's effectiveness on image-text retrieval and image captioning tasks and present deployment-ready benefits.

Proposed method

Use a two-tower architecture with separate image and text encoders.
Adopt cross-modal contrastive learning with an InfoNCE loss to align image-text embeddings.
Incorporate a large momentum-updated dictionary (MoCo-style queues) to provide many negative samples.
Pre-train on RUC-CAS-WenLan (30M image-text pairs) with a 1B-parameter BriVL model; plan to scale to 10B parameters.
Enable easy replacement of encoders with larger单-model backbones and downstream task applicability (retrieval, generation, visual dialog).

Experimental results

Research questions

RQ1Can a two-tower, cross-modal contrastive framework with large negative dictionaries outperform single-tower models on noisy web image-text data?
RQ2Does implicit (weak) cross-modal correlation modeling suffice for strong downstream performance in vision-language tasks?
RQ3What is the impact of scaling BriVL (parameters, data) on retrieval and captioning benchmarks in Chinese multi-modal settings?
RQ4How does BriVL compare with OpenAI CLIP and UNITER on Chinese multi-source data and related downstream tasks?

Key findings

BriVL outperforms CLIP and UNITER on image-text retrieval in the AIC-ICC validation set (Image-to-Text: R@1 20.3 vs CLIP 13.4 and UNITER 14.8; Text-to-Image: R@1 14.4 vs CLIP 7.8 and UNITER 9.8).
BriVL achieves best results on image captioning among the compared methods on AIC-ICC (CIDEr 220.7; BLEU 66.1; METEOR 41.1; ROUGE-L 71.9).
On the WenLan test set, BriVL yields substantial gains in retrieval (Image-to-Text R@1 36.1; Text-to-Image R@1 36.0) over CLIP and UNITER.
A user study corroborates BriVL’s superior retrieval quality versus CLIP, with further gains when BriVL is combined with UNITER.
BriVL demonstrates faster inference (≈CLIP speed, ~20x faster than UNITER) and demonstrates feasibility for cloud APIs and downstream tasks like image-to-text generation.
The model, trained with 128 GPUs for 7 days, scales toward future 10B-parameter iterations with 500M image-text pairs.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.