QUICK REVIEW

[论文解读] Noisy Softmax: Improving the Generalization Ability of DCNN via Postponing the Early Softmax Saturation

Binghui Chen, Weihong Deng|arXiv (Cornell University)|Aug 12, 2017

Advanced Neural Network Applications参考文献 36被引用 30

一句话总结

本文提出 Noisy Softmax，一种在训练过程中向 softmax 层注入衰减噪声的技术，以延迟早期饱和，从而实现持续的梯度流动，并提升深度卷积神经网络（DCNNs）的泛化能力。实验结果表明，该方法在 MNIST、CIFAR、LFW、FGLFW 和 YTF 基准测试中实现了最先进或具有竞争力的性能，显著增强了模型的鲁棒性并减少了过拟合。

ABSTRACT

Over the past few years, softmax and SGD have become a commonly used component and the default training strategy in CNN frameworks, respectively. However, when optimizing CNNs with SGD, the saturation behavior behind softmax always gives us an illusion of training well and then is omitted. In this paper, we first emphasize that the early saturation behavior of softmax will impede the exploration of SGD, which sometimes is a reason for model converging at a bad local-minima, then propose Noisy Softmax to mitigating this early saturation issue by injecting annealed noise in softmax during each iteration. This operation based on noise injection aims at postponing the early saturation and further bringing continuous gradients propagation so as to significantly encourage SGD solver to be more exploratory and help to find a better local-minima. This paper empirically verifies the superiority of the early softmax desaturation, and our method indeed improves the generalization ability of CNN model by regularization. We experimentally find that this early desaturation helps optimization in many tasks, yielding state-of-the-art or competitive results on several popular benchmark datasets.

研究动机与目标

为解决 DCNN 中 softmax 早期饱和的问题，该问题限制了梯度流动并阻碍了 SGD 对参数空间的有效探索。
通过推迟饱和并实现反向传播期间的持续参数更新，提升模型的泛化能力。
提出一种简单、即插即用的方法，无需改变网络结构即可改善训练动态。
通过实证验证，早期去饱和可实现更好的收敛性并减少过拟合。

提出的方法

在每次训练迭代中，直接向 softmax 层的输入注入衰减噪声。
采用随时间递减的噪声调度（即衰减机制），以稳定训练过程并避免后期阶段的干扰。
通过仅修改 softmax 层，保持与标准 SGD 和反向传播的兼容性。
作为标准 softmax 的即插即用替代方案，可应用于任意 DCNN 框架。
引入超参数 α² 以控制噪声幅度，支持针对最优性能的调优。
可与现有技术（如数据增强和对比损失）结合使用，以获得更优结果。

实验结果

研究问题

RQ1早期 softmax 饱和是否阻碍了 SGD 对参数空间的有效探索？
RQ2向 softmax 输入注入衰减噪声是否能延迟饱和并改善梯度传播？
RQ3Noisy Softmax 是否能提升 DCNN 的泛化能力并减少过拟合？
RQ4Noisy Softmax 是否能在不改变网络结构的前提下，在标准基准测试中实现最先进性能？

主要发现

当 α² = 0.05 时，Noisy Softmax 在 CIFAR-10 上达到 7.39% 的错误率，优于标准 softmax（8.11%）及其他最先进方法。
在 LFW 上，Noisy Softmax（α² = 0.1）达到 99.18% 的准确率，超过基线并匹配最先进性能。
在 YTF 上，Noisy Softmax（α² = 0.1）达到 94.88% 的准确率，优于标准 softmax 基线（94.22%）。
集成两个 Noisy Softmax 模型后，在 LFW 上达到 99.31%，在 FGLFW 上达到 94.43%，在 YTF 上达到 95.37%，展现出强大的泛化能力。
该方法在多个数据集（包括 MNIST 和 CIFAR-100）上均持续提升性能，证实了其广泛有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。