QUICK REVIEW

[论文解读] The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies

Ronen Basri, David Jacobs|arXiv (Cornell University)|Jun 2, 2019

Neural Networks and Applications参考文献 26被引用 90

一句话总结

本文分析梯度下降训练的神经网络如何学习不同频率的函数，显示低频成分被更快学习，并强调偏置对学习奇频率的影响。

ABSTRACT

We study the relationship between the frequency of a function and the speed at which a neural network learns it. We build on recent results that show that the dynamics of overparameterized neural networks trained with gradient descent can be well approximated by a linear system. When normalized training data is uniformly distributed on a hypersphere, the eigenfunctions of this linear system are spherical harmonic functions. We derive the corresponding eigenvalues for each frequency after introducing a bias term in the model. This bias term had been omitted from the linear network model without significantly affecting previous theoretical results. However, we show theoretically and experimentally that a shallow neural network without bias cannot represent or learn simple, low frequency functions with odd frequencies. Our results lead to specific predictions of the time it will take a network to learn functions of varying frequency. These predictions match the empirical behavior of both shallow and deep networks.

研究动机与目标

通过考察基于频率的学习动态，激励并分析为何过参数化网络具有良好泛化能力。
表征在一个球面上训练数据如何导致支配学习速度的球谐本征函数。
展示偏置项如何影响奇频成分的可学习性以及由此带来的收敛行为。
提供对每个频率的学习时间的理论预测，并在浅层与深层网络上进行实验验证。

提出的方法

在对 ReLU 激活线性化的情形下，对两层网络的梯度下降动力学进行建模。
定义 Z 矩阵和 Gram/H∞ 矩阵以捕捉训练动力学。
在均匀球面数据下推导 H^∞ 的特征值/特征函数，证明球谐函数是特征函数。
将模型扩展为包含偏置项，展示它如何改变特征结构和奇频率的可学习性。
使用 Funk-Hecke 定理分析球面上的卷积核，得到 K^∞ 和 K̄^∞ 的闭形式特征值。
通过实证验证不同频率和网络深度的收敛速度，并与二次 k 的标度预测进行对比。

实验结果

研究问题

RQ1目标函数的频率如何影响过参数化网络中的梯度下降收敛速度？
RQ2加入偏置项对奇频成分的可学习性有何影响？
RQ3理论特征值/特征函数是否能在浅层和深层网络中转化为观测的学习时间？
RQ4从一维圆数据扩展到更高维度的球面数据，结果如何延伸？
RQ5观察到的基于频率的学习动力学能否解释泛化和提前停止现象？

主要发现

在梯度下降下，目标函数的低频成分比高频成分学习得更快。
在无偏置网络中，奇频率且 k ≥ 3 的向量属于零空间，无法学习或表示。
有偏置时，奇频率是可学习的，特征向量仍然是球谐函数，且各频率的学习速率相当。
频率 k 的收敛时间按平方法则扩展（在 1D 为 k^2），随着维度 d 增大近似为 k^d，与浅层和深层结构的实验结果一致。
经验收敛时间与两层网络、深度网络以及带跳跃连接的网络的理论预测一致，偏置提升了奇频的可学习性。
分析表明梯度下降在某种程度上是一种基于频率的正则化器，在训练中偏好低频（更平滑）的解。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。