QUICK REVIEW

[论文解读] Incremental Self-training for Semi-supervised Learning

Jifeng Guo, Zhulin Liu|arXiv (Cornell University)|Apr 14, 2024

Domain Adaptation and Few-Shot Learning被引用 5

一句话总结

增量自训练（IST）在簇和序列批次中处理未标记数据，优先考虑高置信样本以提升准确性和训练速度，在图像分类任务上超越基线和部分SOTA。

ABSTRACT

Semi-supervised learning provides a solution to reduce the dependency of machine learning on labeled data. As one of the efficient semi-supervised techniques, self-training (ST) has received increasing attention. Several advancements have emerged to address challenges associated with noisy pseudo-labels. Previous works on self-training acknowledge the importance of unlabeled data but have not delved into their efficient utilization, nor have they paid attention to the problem of high time consumption caused by iterative learning. This paper proposes Incremental Self-training (IST) for semi-supervised learning to fill these gaps. Unlike ST, which processes all data indiscriminately, IST processes data in batches and priority assigns pseudo-labels to unlabeled samples with high certainty. Then, it processes the data around the decision boundary after the model is stabilized, enhancing classifier performance. Our IST is simple yet effective and fits existing self-training-based semi-supervised learning methods. We verify the proposed IST on five datasets and two types of backbone, effectively improving the recognition accuracy and learning speed. Significantly, it outperforms state-of-the-art competitors on three challenging image classification tasks.

研究动机与目标

推动在SSL中减少标注数据依赖并高效利用未标注数据。
提出IST，通过聚类依据置信度来区分未标注数据。
实现批量化、序列化数据处理以加速收敛。
证明IST与现有自训练骨干网络和数据集的兼容性。

提出的方法

在初始化时对未标注数据进行聚类，以创建基于置信度的查询清单。
以高置信样本优先的顺序分配伪标签。
以序列批次处理未标注数据以更新分类器。
使用多种聚类方法研究对性能和速度的影响。
比较迭代式和非迭代式骨干网络以展示IST的普适性。

实验结果

研究问题

RQ1增量、簇化和分批伪标签是否能比标准自训练在SSL的准确性和收敛速度上带来提升？
RQ2聚类方法的选择如何影响IST的性能和训练时间？
RQ3IST在迭代式与非迭代式骨干设置下是否保持或提升鲁棒性？

主要发现

方法	聚类与列表	准确度(%)	时间(s)
ST		89.30	57321.65
IST	w/ K-Means	93.17	44796.71
IST	w/ MiniBMean	93.76	44076.97
IST	w/ Meanshift	94.25	157669.75
ST		86.87	156.63
IST	w/ BIRCH	88.97	99.40
IST	w/ K-Means	90.60	97.85
IST	w/ MeanShift	93.28	91.94

在测试设定下，IST在平均准确率上比标准自训练提升6.41%。
在具有挑战性的图像数据集上，IST比SOTA方法提升4%。
相较于跨簇的标准自训练，IST将学习时间大约降低40-50%。
不同聚类方法在准确性与时间之间产生不同的权衡；MeanShift可以提升准确性，但可能大幅增加聚类时间。
IST有效提高准确性与收敛速度，并且可以避免在某些骨干网络中ST出现的部分准确率下降。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。