QUICK REVIEW

[论文解读] Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning

Charles H. Martin, Michael W. Mahoney|arXiv (Cornell University)|Oct 2, 2018

Statistical Mechanics and Entropy参考文献 67被引用 74

一句话总结

简要结论：本文使用随机矩阵理论来分析DNN权重矩阵，显示训练会引入隐式自正则化，并识别出一个5+1阶段分类（包括高尾分布阶段），从而解释泛化差距和批量大小效应。

ABSTRACT

Random Matrix Theory (RMT) is applied to analyze weight matrices of Deep Neural Networks (DNNs), including both production quality, pre-trained models such as AlexNet and Inception, and smaller models trained from scratch, such as LeNet5 and a miniature-AlexNet. Empirical and theoretical results clearly indicate that the DNN training process itself implicitly implements a form of Self-Regularization. The empirical spectral density (ESD) of DNN layer matrices displays signatures of traditionally-regularized statistical models, even in the absence of exogenously specifying traditional forms of explicit regularization. Building on relatively recent results in RMT, most notably its extension to Universality classes of Heavy-Tailed matrices, we develop a theory to identify 5+1 Phases of Training, corresponding to increasing amounts of Implicit Self-Regularization. These phases can be observed during the training process as well as in the final learned DNNs. For smaller and/or older DNNs, this Implicit Self-Regularization is like traditional Tikhonov regularization, in that there is a "size scale" separating signal from noise. For state-of-the-art DNNs, however, we identify a novel form of Heavy-Tailed Self-Regularization, similar to the self-organization seen in the statistical physics of disordered systems. This results from correlations arising at all size scales, which arises implicitly due to the training process itself. This implicit Self-Regularization can depend strongly on the many knobs of the training process. By exploiting the generalization gap phenomena, we demonstrate that we can cause a small model to exhibit all 5+1 phases of training simply by changing the batch size. This demonstrates that---all else being equal---DNN optimization with larger batch sizes leads to less-well implicitly-regularized models, and it provides an explanation for the generalization gap phenomena.

研究动机与目标

为深度学习中的正则化提供超越 dropout 或权重范数等显式技术的实用理论动机。
通过分析层权重矩阵及其RMT派生指标来刻画DNN的能量地貌。
引入在操作上定义的训练阶段，以映射到日益增强的自正则化。
展示训练参数（如批量大小）如何影响阶段转变并影响泛化。

提出的方法

将每个DNN层权重矩阵W建模为W = W_rand + Δsig，以将随机分量与信号分量分离。
分析X = (1/N) W^T W的经验谱密度(ESD)，并用Marchenko-Pastur(MP)理论及高尾普适性类进行拟合。
从谱中定义并计算容量度量：Hard Rank、Matrix Entropy、Stable Rank，以及MP Soft Rank。
提出并验证5+1阶段分类法（Random-like、Bleeding-out、Bulk+Spikes、Bulk-decay、Heavy-Tailed、Rank-collapse），以对应隐式正则化水平。
通过在较小模型上操作训练参数（如批量大小）来展示相变，并将结果与预训练的大模型进行比较。

实验结果

研究问题

RQ1随机矩阵理论是否能够解释DNN训练如何在没有显式惩罚项的情况下产生正则化？
RQ2权重矩阵在光谱（ESD）上有什么特征能够反映不同水平的隐式自正则化？
RQ3训练参数，特别是批量大小，如何推动所鉴别阶段之间的转变并影响泛化？

主要发现

较老/较小的模型表现出弱的、类似Tikhonov的隐式正则化，并在MP项中存在信号-噪声分离。
现代大模型表现出高尾自正则化，没有清晰的信号-噪声分离且谱支撑有限。
在训练过程中和最终模型中都可观察到阶段，随着隐式正则化的增强，MP Soft Rank下降，Stable Rank也在下降。
减小批量大小可以将一个小模型驱动通过全部5+1阶段，将泛化差距与隐式正则化联系起来。
显式正则化可能会诱发Rank-collapse阶段，说明正则化强度如何形塑谱和容量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。