QUICK REVIEW

[论文解读] Distillation $\approx$ Early Stopping? Harvesting Dark Knowledge Utilizing Anisotropic Information Retrieval For Overparameterized Neural Network

Bin Dong, Jikai Hou|arXiv (Cornell University)|Oct 2, 2019

Advanced Neural Network Applications参考文献 53被引用 27

一句话总结

本文提出，在过参数化的神经网络中，知识蒸馏主要通过早停实现，使教师网络能够在噪声出现前捕捉到‘暗知识’——即信息丰富的模式。通过引入各向异性信息检索（AIR）以及一种在训练周期间动态传递知识的自蒸馏算法，该方法在无需早停的情况下实现了更好的泛化性能和标签恢复能力，并在ℓ₂范数下理论上收敛至真实标签。

ABSTRACT

Distillation is a method to transfer knowledge from one model to another and often achieves higher accuracy with the same capacity. In this paper, we aim to provide a theoretical understanding on what mainly helps with the distillation. Our answer is "early stopping". Assuming that the teacher network is overparameterized, we argue that the teacher network is essentially harvesting dark knowledge from the data via early stopping. This can be justified by a new concept, {Anisotropic Information Retrieval (AIR)}, which means that the neural network tends to fit the informative information first and the non-informative information (including noise) later. Motivated by the recent development on theoretically analyzing overparameterized neural networks, we can characterize AIR by the eigenspace of the Neural Tangent Kernel(NTK). AIR facilities a new understanding of distillation. With that, we further utilize distillation to refine noisy labels. We propose a self-distillation algorithm to sequentially distill knowledge from the network in the previous training epoch to avoid memorizing the wrong labels. We also demonstrate, both theoretically and empirically, that self-distillation can benefit from more than just early stopping. Theoretically, we prove convergence of the proposed algorithm to the ground truth labels for randomly initialized overparameterized neural networks in terms of $\ell_2$ distance, while the previous result was on convergence in $0$-$1$ loss. The theoretical result ensures the learned neural network enjoy a margin on the training data which leads to better generalization. Empirically, we achieve better testing accuracy and entirely avoid early stopping which makes the algorithm more user-friendly.

研究动机与目标

从理论上理解知识蒸馏为何能提升模型性能，尤其是在过参数化网络中的原因。
探究知识蒸馏的有效性是否源于早停而非软标签指导。
开发一种自蒸馏算法，通过利用各向异性信息检索（AIR）避免对噪声标签的过拟合。
证明所提算法在ℓ₂距离下收敛至真实标签，从而在泛化性能上超越0-1损失的收敛。
展示该方法具有更强的泛化能力，且无需早停，提升了用户友好性。

提出的方法

提出各向异性信息检索（AIR），即神经网络在噪声出现前拟合信息丰富的数据模式，其特性通过神经正切核（NTK）的特征空间表征。
提出一种自蒸馏算法，利用前一训练周期的网络输出作为软目标，监督当前周期的训练。
在各训练周期中动态调整监督强度，以防止对错误标签的记忆。
理论分析表明，对于过参数化网络，该方法在ℓ₂范数下收敛至真实标签，确保了训练数据上的分类边界。
在Fashion-MNIST和CIFAR-10数据集上进行实证验证，结果表明该方法达到SOTA性能，并对噪声标签具有鲁棒性。
采用干净标签的ℓ₂损失以确保基于边界的泛化性能，与以往聚焦于0-1损失的工作形成对比。

实验结果

研究问题

RQ1在过参数化网络中，知识蒸馏是否主要通过早停而非软标签蒸馏起作用？
RQ2各向异性信息检索（AIR）能否解释为何过参数化网络在记忆噪声之前先捕捉到‘暗知识’？
RQ3一种在训练周期间传递知识的自蒸馏算法，是否能在无需早停的情况下恢复正确标签？
RQ4所提方法中基于ℓ₂的收敛是否优于0-1损失的收敛，带来更好的泛化性能？
RQ5该方法能否有效优化噪声标签，同时保持高测试准确率？

主要发现

理论分析证明，该自蒸馏算法在过参数化神经网络中，于ℓ₂距离下收敛至真实标签，确保了训练数据上的分类边界。
在带有噪声标签的Fashion-MNIST和CIFAR-10数据集上，该方法实现了SOTA测试准确率，优于先前方法。
实证结果表明，该算法避免了对噪声的过拟合，且无需早停，显著提升了用户友好性。
自蒸馏带来的信息增益——即投影到top-NTK特征空间的提升——在1500次迭代内持续上升，表明对干净信号的渐进式学习。
在学习率、监督强度和网络宽度满足特定条件下，该算法的收敛性可被保证，且给出了所需训练步数的显式上界。
ℓ₂收敛结果确保了优于以往基于0-1损失收敛的泛化性能，因其隐含了训练数据上的分类边界。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。