QUICK REVIEW

[论文解读] Selfie: Self-supervised Pretraining for Image Embedding

Trieu H. Trinh, Minh-Thang Luong|arXiv (Cornell University)|Jun 7, 2019

Multimodal Machine Learning Applications参考文献 42被引用 76

一句话总结

Selfie 使用带有同一图像中干扰补丁的掩码补丁预测任务来预训练图像编码器，从而提升下游准确性和训练稳定性，尤其在带标注数据有限的情况下。

ABSTRACT

We introduce a pretraining technique called Selfie, which stands for SELFie supervised Image Embedding. Selfie generalizes the concept of masked language modeling of BERT (Devlin et al., 2019) to continuous data, such as images, by making use of the Contrastive Predictive Coding loss (Oord et al., 2018). Given masked-out patches in an input image, our method learns to select the correct patch, among other "distractor" patches sampled from the same image, to fill in the masked location. This classification objective sidesteps the need for predicting exact pixel values of the target patches. The pretraining architecture of Selfie includes a network of convolutional blocks to process patches followed by an attention pooling network to summarize the content of unmasked patches before predicting masked ones. During finetuning, we reuse the convolutional weights found by pretraining. We evaluate Selfie on three benchmarks (CIFAR-10, ImageNet 32 x 32, and ImageNet 224 x 224) with varying amounts of labeled data, from 5% to 100% of the training sets. Our pretraining method provides consistent improvements to ResNet-50 across all settings compared to the standard supervised training of the same network. Notably, on ImageNet 224 x 224 with 60 examples per class (5%), our method improves the mean accuracy of ResNet-50 from 35.6% to 46.7%, an improvement of 11.1 points in absolute accuracy. Our pretraining method also improves ResNet-50 training stability, especially on low data regime, by significantly lowering the standard deviation of test accuracies across different runs.

研究动机与目标

通过利用未标注数据来降低图像模型对标注数据的需求。
将掩码语言建模的概念扩展到连续的图像数据。
提出一种基于补丁的编码器-解码器结构，结合对比分类来填充被掩盖的区域。
通过在微调阶段重复使用部分网络，实现高效的预训练。
在低标注条件下展示在 CIFAR-10、ImageNet-32 和 ImageNet-224 上的提升。

提出的方法

用补丁处理网络 P 编码图像补丁（ResNet-50 的前 3 个块）。
利用注意力池化网络 A（基于 Transformer）对补丁表示进行聚合。
对部分图像补丁进行掩蔽，并让解码器在同一图像的干扰补丁中识别正确的补丁，使用交叉熵损失。
端到端地联合训练编码器和解码器；微调阶段重复使用预训练的 P，并应用整个图像的 ResNet-50 进行端到端微调。
对补丁使用位置嵌入（依图像大小而定）和部分参数共享以降低计算量。
在预训练阶段，解码器同时预测多个正确补丁，以重用编码器计算。

实验结果

研究问题

RQ1以补丁级掩蔽和干扰项进行自监督预训练是否能提升下游任务的图像表征？
RQ2在不同标注数据条件下，Selfie 与全监督基线相比的表现如何？
RQ3预训练对训练稳定性和跨次运行的变异性有何影响？
RQ4在微调阶段使用注意力池化和混合卷积-注意力架构的效果如何？
RQ5未标注数据相对于标注数据的丰富程度对 Selfie 的收益有何影响？

主要发现

数据集	标注数据比例	有监督	Selfie 预训练	Δ（Selfie - 有监督）
CIFAR-10	5%	75.9 ± 0.7	75.9 ± 0.4	0.0
CIFAR-10	8%	79.3 ± 1.0	80.3 ± 0.3	+1.0
CIFAR-10	20%	88.3 ± 0.3	89.1 ± 0.5	+0.8
CIFAR-10	100%	95.5 ± 0.2	95.7 ± 0.1	+0.2
ImageNet-32×32	5%	13.1 ± 0.8	18.3 ± 0.1	+5.2
ImageNet-32×32	10%	25.9 ± 0.5	30.2 ± 0.5	+4.3
ImageNet-32×32	20%	32.7 ± 0.4	33.5 ± 0.2	+0.8
ImageNet-32×32	100%	55.7 ± 0.6	56.4 ± 0.6	+0.7
ImageNet-224×224	5%	35.6 ± 0.7	46.7 ± 0.4	+11.1
ImageNet-224×224	10%	59.6 ± 0.2	61.9 ± 0.2	+2.3
ImageNet-224×224	20%	65.7 ± 0.2	67.1 ± 0.2	+1.4
ImageNet-224×224	100%	76.9 ± 0.2	77.0 ± 0.1	+0.1

随着标注数据减少，Selfie 在 CIFAR-10、ImageNet-32 和 ImageNet-224 上都带来持续的准确率提升。
在 ImageNet-224×224 的 5% 标注数据条件下，准确率从 35.6%（有监督）提升到 46.7%（Selfie），提升 11.1 点。
预训练降低测试准确率的变异性，并提高训练稳定性，尤其在低数据条件下。
在每类有 60 个标注样例的 ImageNet-224×224 下，显示出显著提升（11.1 点），随着标注数据增加，提升变小。
在某些低数据场景下，经 Selfie 微调的混合架构 ResNet-36 + 注意力池化的性能可超过 ResNet-50。
Selfie 在 ImageNet 上的表现优于之前的无监督预训练结果，表明有效利用未标注数据进行表征学习。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。