QUICK REVIEW

[论文解读] Theory II: Landscape of the Empirical Risk in Deep Learning

Qianli Liao, Tomaso Poggio|arXiv (Cornell University)|Mar 28, 2017

Domain Adaptation and Few-Shot Learning参考文献 7被引用 47

一句话总结

本文研究了过参数化深度卷积神经网络（DCNNs）的损失曲面，提出经验风险曲面由大量零训练误差的退化全局最小值组成。基于贝祖定理（Bezout's theorem）和ReLU的多项式逼近的理论分析，证明了此类最小值的存在。在CIFAR-10上的多维缩放（multidimensional scaling）可视化与扰动实验结果表明，SGD即使在小权重扰动后仍能收敛至平坦且鲁棒的全局最小值，表明损失曲面比普遍认为的更简单——其结构为一系列高维、相对规则的盆地，实际中不存在局部最小值。

ABSTRACT

Previous theoretical work on deep learning and neural network optimization tend to focus on avoiding saddle points and local minima. However, the practical observation is that, at least in the case of the most successful Deep Convolutional Neural Networks (DCNNs), practitioners can always increase the network size to fit the training data (an extreme example would be [1]). The most successful DCNNs such as VGG and ResNets are best used with a degree of "overparametrization". In this work, we characterize with a mix of theory and experiments, the landscape of the empirical risk of overparametrized DCNNs. We first prove in the regression framework the existence of a large number of degenerate global minimizers with zero empirical error (modulo inconsistent equations). The argument that relies on the use of Bezout theorem is rigorous when the RELUs are replaced by a polynomial nonlinearity (which empirically works as well). As described in our Theory III [2] paper, the same minimizers are degenerate and thus very likely to be found by SGD that will furthermore select with higher probability the most robust zero-minimizer. We further experimentally explored and visualized the landscape of empirical risk of a DCNN on CIFAR-10 during the entire training process and especially the global minima. Finally, based on our theoretical and experimental results, we propose an intuitive model of the landscape of DCNN's empirical loss surface, which might not be as complicated as people commonly believe.

研究动机与目标

理解过参数化深度网络中经验风险曲面的结构，尤其针对VGG和ResNets等成功DCNN的背景。
探究尽管存在巨大过参数化，随机梯度下降（SGD）为何仍能实现良好泛化。
挑战普遍认为损失曲面高度复杂、包含大量局部最小值与鞍点的观点。
基于理论与实证证据，提出损失曲面的简化基线模型。

提出的方法

理论分析基于贝祖定理，证明在回归框架下，当ReLU被多项式或勒让德展开逼近时，存在大量零误差的全局最小化器。
分析扩展至分类任务，表明零误差意味着存在间隔（margin），从而指示全局最小值周围存在平坦区域。
采用多维缩放（MDS）技术可视化CIFAR-10上SGD训练过程中整个训练轨迹与损失曲面演化过程。
通过向已训练的零误差模型添加小高斯噪声并重新训练，开展扰动实验，以评估鲁棒性与收敛路径。
在盆地内部及跨盆地进行插值实验，以评估泛化性能与误差行为。
比较SGD与批量梯度下降的训练动态，评估噪声在避免局部最小值中的作用。

实验结果

研究问题

RQ1在过参数化的DCNN中，存在多少个具有零经验误差的全局最小化器？它们是否退化？
RQ2过参数化DCNN的损失曲面是否包含局部最小值，还是主要由平坦的全局最小值主导？
RQ3训练过程中，训练轨迹与损失曲面如何演化？SGD中的随机性起到何种作用？
RQ4对已训练的零误差模型施加扰动后，是否会导致不同的收敛路径？是否能保持零训练误差？
RQ5损失曲面的结构是否具有简化性？能否将其建模为一系列高维盆地的集合？

主要发现

理论分析在ReLU多项式逼近的假设下，基于贝祖定理，证明了在过参数化DCNN中存在大量退化的零经验误差全局最小化器。
相同的零误差最小化器具有高度退化性，使其更易被SGD找到，因为SGD倾向于选择鲁棒解。
基于MDS的实验可视化显示，不同初始化下的训练轨迹收敛至多个独立但同样有效的零误差解。
对已训练的零误差模型（M_final）施加小高斯噪声（0.01倍平均权重幅值）后，训练误差未上升，且所有模型在400个周期的全批量梯度下降中始终保持0%训练误差。
尽管扰动后权重发生显著变化，所有轨迹仍保持在同一个损失盆地内，表明损失曲面未被局部最小值所分割。
即使在批量梯度下降下也未观察到局部最小值，表明损失曲面主要由平坦且连通的全局最小值盆地主导。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。