QUICK REVIEW

[论文解读] Learning from Noisy Labels with Distillation

Yuncheng Li, Shuicheng Yan|arXiv (Cornell University)|Mar 7, 2017

Machine Learning and Data Classification参考文献 18被引用 58

一句话总结

提出一种基于蒸馏的框架，在只有少量干净集的情况下从大型嘈杂数据集中学习，并用知识图谱来引导蒸馏；引入现实世界的嘈杂标签基准。

ABSTRACT

The ability of learning from noisy labels is very useful in many visual recognition tasks, as a vast amount of data with noisy labels are relatively easy to obtain. Traditionally, the label noises have been treated as statistical outliers, and approaches such as importance re-weighting and bootstrap have been proposed to alleviate the problem. According to our observation, the real-world noisy labels exhibit multi-mode characteristics as the true labels, rather than behaving like independent random outliers. In this work, we propose a unified distillation framework to use side information, including a small clean dataset and label relations in knowledge graph, to "hedge the risk" of learning from noisy labels. Furthermore, unlike the traditional approaches evaluated based on simulated label noises, we propose a suite of new benchmark datasets, in Sports, Species and Artifacts domains, to evaluate the task of learning from noisy labels in the practical setting. The empirical study demonstrates the effectiveness of our proposed method in all the domains.

研究动机与目标

Motivate learning from large-scale noisy datasets where clean labels are scarce.
Propose a distillation framework that leverages a small clean dataset to guide learning from noisy data.
Integrate a knowledge graph to propagate label confidence and reduce model variance.
Create real-world benchmark datasets with label noise to evaluate practicality.

提出的方法

Train an auxiliary model on a small clean dataset.
Train a primary model on the full noisy dataset using a distillation loss that combines noisy labels with auxiliary predictions (pseudo labels).
Introduce a knowledge-graph-guided soft label to further guide training (GSi).
Derive theoretical insight showing the pseudo-label risk can be reduced compared to using only noisy or clean labels.
Construct real-world benchmark datasets from YFCC100M across Sports, Species, and Artifacts with partial clean data.
Evaluate against baselines including training on clean, training on noisy, bootstrapping, label smoothing, and various reweighting methods.

实验结果

研究问题

RQ1Can distillation from a small clean dataset improve learning on large noisy datasets?
RQ2Does incorporating a knowledge graph to guide distillation further hedge against noisy labels?
RQ3How do the proposed methods perform on real-world noisy datasets across diverse domains?
RQ4How close can distillation with a knowledge graph get to an upper bound set by fully clean labels?

主要发现

体育	物种-Y	物种-I	工件
44.0	18.1	22.0	19.2
50.7	23.7	38.5	22.0
52.2	25.1	39.1	26.9
50.6	23.6	38.8	23.4
51.9	25.1	41.4	22.9
50.8	22.2	37.5	19.7
50.8	23.7	38.5	22.0
50.8	23.7	41.6	24.8
53.5	26.1	41.6	26.0
53.7	25.2	42.3	26.0
54.1	27.4	-	-

Distillation outperforms baselines across all four datasets (Sports, Species-Y, Species-I, Artifacts).
Semantic/knowledge-graph guided distillation yields additional gains over standard distillation.
The proposed methods approach the upper bound (fully clean data) on several datasets, reducing the gap caused by noise.
The learned pseudo labels improve ranking of true positives and reduce false positives compared to noisy labels.
Temperature parameter T shows stability in performance.
Real-world noisy benchmarks emphasize practical relevance beyond synthetic noise.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。