[论文解读] Small Data Challenges in Big Data Era: A Survey of Recent Progress on Unsupervised and Semi-Supervised Methods
本综述回顾了在大数据时代应对小样本数据挑战的无监督与半监督表征学习的最新进展。它整合了变换等变性、解耦表征和自监督等原则,并将其融入生成模型(自编码器、生成对抗网络、流模型、变换器)中,通过自监督正则化在极少标注数据下提升泛化能力,在下游任务中实现最先进性能。
Representation learning with small labeled data have emerged in many problems, since the success of deep neural networks often relies on the availability of a huge amount of labeled data that is expensive to collect. To address it, many efforts have been made on training sophisticated models with few labeled data in an unsupervised and semi-supervised fashion. In this paper, we will review the recent progresses on these two major categories of methods. A wide spectrum of models will be categorized in a big picture, where we will show how they interplay with each other to motivate explorations of new ideas. We will review the principles of learning the transformation equivariant, disentangled, self-supervised and semi-supervised representations, all of which underpin the foundation of recent progresses. Many implementations of unsupervised and semi-supervised generative models have been developed on the basis of these criteria, greatly expanding the territory of existing autoencoders, generative adversarial nets (GANs) and other deep networks by exploring the distribution of unlabeled data for more powerful representations. We will discuss emerging topics by revealing the intrinsic connections between unsupervised and semi-supervised learning, and propose in future directions to bridge the algorithmic and theoretical gap between transformation equivariance for unsupervised learning and supervised invariance for supervised learning, and unify unsupervised pretraining and supervised finetuning. We will also provide a broader outlook of future directions to unify transformation and instance equivariances for representation learning, connect unsupervised and semi-supervised augmentations, and explore the role of the self-supervised regularization for many learning problems.
研究动机与目标
- 在拥有大量未标注数据但标注数据有限的情况下,解决训练深度模型的挑战。
- 系统性地分类近期无监督与半监督表征学习的进展。
- 在生成模型中统一整合变换等变性、解耦与自监督等核心原则。
- 探索无监督与半监督学习之间的联系,以缩小预训练与微调之间的差距。
- 提出未来在理论与算法层面整合自监督正则化以跨任务应用的方向。
提出的方法
- 基于变换等变性、解耦等原则,对无监督与半监督方法进行分类。
- 回顾自编码器、生成对抗网络、基于流的网络、自回归模型及变换器等生成模型。
- 在半监督训练中,将旋转预测、拼图(jigsaw)和实例变换预测等自监督损失作为正则化项整合。
- 通过共享表征学习对齐源域与目标域,将自监督正则化应用于领域自适应。
- 提出一个统一框架,用于表征学习中的变换等变性与实例等变性。
- 利用教师-学生模型与对比学习,在无需任务特定标签的情况下增强特征泛化能力。
实验结果
研究问题
- RQ1如何利用变换等变性来提升无监督表征学习?
- RQ2解耦表征学习在提升可解释性与泛化能力方面发挥什么作用?
- RQ3在半监督学习中,如何有效结合自监督损失与监督信号?
- RQ4自监督正则化在何种方式下能改善跨数据分布的领域自适应?
- RQ5在理论与算法层面,哪些联系可统一无监督预训练与监督微调?
主要发现
- 如旋转预测与拼图预测等自监督损失显著提升了半监督学习中的泛化能力,在基准数据集上达到最先进性能。
- 通过自监督学习获得的变换等变表征,即使在标注数据极少的情况下,也能实现稳健的特征学习。
- 生成模型如生成对抗网络与基于流的网络从自监督正则化中受益,提升了样本的多样性与保真度。
- 自监督领域自适应方法通过在域间对齐表征,无需目标域标注数据即可缩小领域差距。
- 将自监督正则化与监督学习结合,可减少对大规模标注数据的依赖,同时保持高精度。
- 提出一个统一框架用于变换等变性与实例等变性,作为未来方向以增强跨任务的鲁棒性与泛化能力。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。