[论文解读] WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training
WenLan 提出 BriVL,一种基于 MoCo-inspired 的双塔跨模态对比预训练模型,建立在 30M image-text Chinese dataset 上训练,旨在下游视觉-语言任务中超越 UNITER 与 CLIP。
Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.
研究动机与目标
- Motivate robust multi-modal understanding under weak image-text correlations common in web data.
- Propose a two-tower cross-modal pre-training framework (BriVL) leveraging MoCo-inspired contrastive learning.
- Construct a large Chinese multi-source image-text dataset (RUC-CAS-WenLan) for pre-training.
- Demonstrate BriVL's effectiveness on image-text retrieval and image captioning tasks and present deployment-ready benefits.
提出的方法
- Use a two-tower architecture with separate image and text encoders.
- Adopt cross-modal contrastive learning with an InfoNCE loss to align image-text embeddings.
- Incorporate a large momentum-updated dictionary (MoCo-style queues) to provide many negative samples.
- Pre-train on RUC-CAS-WenLan (30M image-text pairs) with a 1B-parameter BriVL model; plan to scale to 10B parameters.
- Enable easy replacement of encoders with larger单-model backbones and downstream task applicability (retrieval, generation, visual dialog).
实验结果
研究问题
- RQ1Can a two-tower, cross-modal contrastive framework with large negative dictionaries outperform single-tower models on noisy web image-text data?
- RQ2Does implicit (weak) cross-modal correlation modeling suffice for strong downstream performance in vision-language tasks?
- RQ3What is the impact of scaling BriVL (parameters, data) on retrieval and captioning benchmarks in Chinese multi-modal settings?
- RQ4How does BriVL compare with OpenAI CLIP and UNITER on Chinese multi-source data and related downstream tasks?
主要发现
- BriVL outperforms CLIP and UNITER on image-text retrieval in the AIC-ICC validation set (Image-to-Text: R@1 20.3 vs CLIP 13.4 and UNITER 14.8; Text-to-Image: R@1 14.4 vs CLIP 7.8 and UNITER 9.8).
- BriVL achieves best results on image captioning among the compared methods on AIC-ICC (CIDEr 220.7; BLEU 66.1; METEOR 41.1; ROUGE-L 71.9).
- On the WenLan test set, BriVL yields substantial gains in retrieval (Image-to-Text R@1 36.1; Text-to-Image R@1 36.0) over CLIP and UNITER.
- A user study corroborates BriVL’s superior retrieval quality versus CLIP, with further gains when BriVL is combined with UNITER.
- BriVL demonstrates faster inference (≈CLIP speed, ~20x faster than UNITER) and demonstrates feasibility for cloud APIs and downstream tasks like image-to-text generation.
- The model, trained with 128 GPUs for 7 days, scales toward future 10B-parameter iterations with 500M image-text pairs.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。