QUICK REVIEW

[论文解读] Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey

Longlong Jing, Yingli Tian|arXiv (Cornell University)|Feb 16, 2019

Advanced Image and Video Retrieval Techniques参考文献 164被引用 177

一句话总结

本论文提供了深度 ConvNet 基于自监督视觉特征学习的综合调查，详细介绍了架构、前任务、数据集、评估和未来方向。

ABSTRACT

Large-scale labeled data are generally required to train deep neural networks in order to obtain better performance in visual feature learning from images or videos for computer vision applications. To avoid extensive cost of collecting and annotating large-scale datasets, as a subset of unsupervised learning methods, self-supervised learning methods are proposed to learn general image and video features from large-scale unlabeled data without using any human-annotated labels. This paper provides an extensive review of deep learning-based self-supervised general visual feature learning methods from images or videos. First, the motivation, general pipeline, and terminologies of this field are described. Then the common deep neural network architectures that used for self-supervised learning are summarized. Next, the main components and evaluation metrics of self-supervised learning methods are reviewed followed by the commonly used image and video datasets and the existing self-supervised visual feature learning methods. Finally, quantitative performance comparisons of the reviewed methods on benchmark datasets are summarized and discussed for both image and video feature learning. At last, this paper is concluded and lists a set of promising future directions for self-supervised visual feature learning.

研究动机与目标

激发使用自监督学习从大规模无标签数据中学习视觉特征。
回顾用于自监督视觉特征学习的网络架构和常见的前任务。
总结用于评估所学特征的数据集、评估协议和下游任务。
提供定量性能比较并讨论有前景的未来方向。

提出的方法

描述一个通用的自监督学习流程，其中在一个 pretext 任务上使用自动生成的伪标签训练一个 ConvNet，然后迁移到下游任务。
将学习范式分类为（监督、半监督、弱监督、无监督，重点放在自监督上），并形式化它们的损失目标。
根据用于监督的数据属性，将前任务分为基于生成、基于上下文、基于自由语义标签和跨模态四类。
概述常见的图像与视频架构（AlexNet、VGG、GoogLeNet、ResNet、DenseNet；2D/3D ConvNets；基于 LSTM 的模型）及其在特征学习中的作用。
通过下游任务（如图像分类、语义分割、目标检测和人体动作识别）以及定性可视化来解释评估。
总结常用的图像/视频数据集，并讨论前任务如何驱动所学特征的质量。

实验结果

研究问题

RQ1什么前任务和体系结构选择能够在自监督学习中产生可迁移的高质量视觉特征？
RQ2自监督特征在下游任务如图像分类、分割、检测和动作识别间的表现有何差异？
RQ3在评估和基准自监督视觉特征学习方法方面有哪些有效的策略？
RQ4哪些未来方向能够缩小自监督与有监督在视觉任务上的差距？

主要发现

自监督方法可以在没有人类注释的大规模无标签数据上学习可迁移的视觉特征。
前任务被分为基于生成、基于上下文、基于自由语义标签和跨模态四类，每类都在引导特征学习。
常见的下游评估包括图像分类、语义分割、目标检测和动作识别，用以评估特征的泛化能力。
在大规模数据上进行预训练的自监督模型可以加速训练并提升下游性能，缩小与有监督方法的差距。
本文提供了跨方法和数据集的定量性能比较，突出趋势和需要改进的领域。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。