QUICK REVIEW

[论文解读] Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup

Sebastian Goldt, Madhu Advani|arXiv (Cornell University)|Jun 18, 2019

Stochastic Gradient Optimization Techniques被引用 33

一句话总结

本文在教师-学生框架下分析过参数化的两层网络的在线 SGD 动态，推导宏观序参量的ODE，并展示泛化如何随过参数化规模变化，取决于所训练的层和激活函数。

ABSTRACT

Deep neural networks achieve stellar generalisation even when they have enough parameters to easily fit all their training data. We study this phenomenon by analysing the dynamics and the performance of over-parameterised two-layer neural networks in the teacher-student setup, where one network, the student, is trained on data generated by another network, called the teacher. We show how the dynamics of stochastic gradient descent (SGD) is captured by a set of differential equations and prove that this description is asymptotically exact in the limit of large inputs. Using this framework, we calculate the final generalisation error of student networks that have more parameters than their teachers. We find that the final generalisation error of the student increases with network size when training only the first layer, but stays constant or even decreases with size when training both layers. We show that these different behaviours have their root in the different solutions SGD finds for different activation functions. Our results indicate that achieving good generalisation in neural networks goes beyond the properties of SGD alone and depends on the interplay of at least the algorithm, the model architecture, and the data set.

研究动机与目标

激发并理解为何在实践中高度过参数化的网络能很好地泛化。
在教师-学生设定中建立一个严格的宏观描述（在线 SGD 动力学的 ODEs）。
当仅训练第一层时，分析过参数化学生的渐近泛化误差。
分析同时训练两层如何改变泛化并识别与激活函数相关的行为。
提供对ODE框架与SGD仿真的一致性分析与数值验证。

提出的方法

将输入建模为与教师和学生两层网络独立同分布的高斯变量。
定义序参量 m = (R, Q, T, v*, v)，捕捉教师-学生与学生重叠。
推导 dR/dα、dQ/dα、dv/dα 的耦合 ODE，并表明它们以 m(α) 闭合。
证明严格的收敛结论：在大 N 极限下，SGD 的宏观状态遵从 ODE 的唯一解。
计算不同激活函数（S 型、线性、ReLU）与不同训练设置下的渐近泛化误差 ε_g*。
以SGD仿真与有限尺寸实验验证分析预测。

实验结果

研究问题

RQ1随着网络规模的扩大，教师-学生两层设定下在线 SGD 动态如何演化？
RQ2当仅训练第一层时，过参数化（K > M）如何影响最终泛化误差？
RQ3同时训练两层如何改变渐近泛化误差，以及对不同激活函数，SGD 收敛到何种解？
RQ4激活函数在SGD 动力学下的不动点与泛化性能中起到何种作用？
RQ5ODE 框架是否能够准确预测在各种架构和数据情形下的 SGD 结果？

主要发现

对于软委员会机，只有第一层训练时，最终泛化误差随额外隐藏单元数 L 增加而上升。
对于 S 型和线性激活，ε_g* 作为 η、σ^2 和 L 的函数而尺度化，显示在单层训练阶段更大的过参数化会降低泛化。
同时训练两层时，S 型网络泛化改善，因为出现去噪解法使多个学生单元专化并有效地对教师输出进行平均。
ReLU 和线性网络在两层都训练时，ε_g* 随 K 增长保持不变，表明在这些情况下过参数化的好处不那么明显。
解析表达式与数值结果表明，SGD 的隐式正则化高度依赖算法、结构与数据，而不仅仅是 SGD。
作者提供可复现的工作流程，包括一个 ODE 积分器和实验，在公共仓库。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。