QUICK REVIEW

[论文解读] Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Yangguang Li, Feng Liang|arXiv (Cornell University)|Oct 11, 2021

Multimodal Machine Learning Applications参考文献 39被引用 127

一句话总结

DeCLIP 引入自监督、多视图监督和最近邻监督，以提高对比语言-图像预训练的数据效率，在比 CLIP 使用显著更少的数据量的情况下实现强劲的零样本和迁移性能。

ABSTRACT

Recently, large-scale Contrastive Language-Image Pre-training (CLIP) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. However, CLIP is quite data-hungry and requires 400M image-text pairs for pre-training, thereby restricting its adoption. This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP), to alleviate this limitation. We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our De-CLIP can learn generic visual features more efficiently. Instead of using the single image-text contrastive supervision, we fully exploit data potential through the use of (1) self-supervision within each modality; (2) multi-view supervision across modalities; (3) nearest-neighbor supervision from other similar pairs. Benefiting from intrinsic supervision, our DeCLIP-ResNet50 can achieve 60.4% zero-shot top1 accuracy on ImageNet, which is 0.8% above the CLIP-ResNet50 while using 7.1 x fewer data. Our DeCLIP-ResNet50 outperforms its counterpart in 8 out of 11 visual datasets when transferred to downstream tasks. Moreover, Scaling up the model and computing also works well in our framework.Our code, dataset and models are released at: https://github.com/Sense-GVT/DeCLIP

研究动机与目标

在不依赖大规模数据集的前提下，激发从图像-文本对学习数据高效的视觉特征。
挖掘单一模态内部及跨模态的内在监督，以学习鲁棒的表示。
引入最近邻监督，以利用跨对中的相似描述。
展示在多种架构和数据集上的数据效率与可迁移性。

提出的方法

在 CLIP 框架之上构建一个包含图像和文本两塔编码器的设置。
在每种模态内加入自监督，图像使用 SimSiam，文本使用 MLM。
通过对来自增强视图的 2x2 图像-文本对进行对比，引入多视图监督。
通过采样最近文本嵌入作为额外监督，并通过一个 FIFO 嵌入队列实现，提出最近邻监督。
将损失函数合成为 L_DeCLIP = (1-α-β-γ)L_CLIP + αL_ISS + αL_TSS + βL_MVS + γL_NNS。

实验结果

研究问题

RQ1多模态数据中的内在监督是否能够提高语言-图像预训练的数据效率？
RQ2自监督、多视图与最近邻信号如何影响零样本与迁移性能？
RQ3在不同编码器架构和数据集规模下，DeCLIP 的数据效率与可扩展性如何？
RQ4在较少的预训练数据下，DeCLIP 是否在下游任务上保持具有竞争力或更优的性能？

主要发现

DeCLIP 在 ImageNet 上以 88M 数据实现 60.4% 零样本 Top-1，相对于 7.1 倍更少的数据量的 CLIP-ResNet50 高出 0.8%。
在相同数据预算（88M）下，DeCLIP-ResNet50/ViT-B32 的零样本准确率分别达到 62.5% 和 66.2%，优于对应的 CLIP 版本。
扩展到更大模型（RegNetY-64GF + BERT）在 88M 数据下达到 73.7% 零样本准确率，接近 CLIP-R50×64，同时资源消耗更少。
相较于 CLIP，DeCLIP 在 11 个下游数据集中的 8 个数据集上提升了迁移性能（平均增益约 0.8%）。
消融实验表明自监督、多视图和最近邻信号分别对性能提升有贡献，其中最近邻监督带来显著的改进。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。