QUICK REVIEW

[论文解读] On the Rate of Convergence of GD in Non-linear Neural Networks: An Adversarial Robustness Perspective

Guy Smorodinsky, Sveta Gimpleson|arXiv (Cornell University)|Mar 2, 2026

Stochastic Gradient Optimization Techniques被引用 0

一句话总结

这篇论文证明在一个最小的两神经元 ReLU 网络上，梯度下降收敛到对抗鲁棒性边际，但速度极慢，Θ(1/ln t)，与梯度流不同，后者在极限下实现边际优化。

ABSTRACT

We study the convergence dynamics of Gradient Descent (GD) in a minimal binary classification setting, consisting of a two-neuron ReLU network and two training instances. We prove that even under these strong simplifying assumptions, while GD successfully converges to an optimal robustness margin, effectively maximizing the distance between the decision boundary and the training points, this convergence occurs at a prohibitively slow rate, scaling strictly as $Θ(1/\ln(t))$. To the best of our knowledge, this establishes the first explicit lower bound on the convergence rate of the robustness margin in a non-linear model. Through empirical simulations, we further demonstrate that this inherent failure mode is pervasive, exhibiting the exact same tight convergence rate across multiple natural network initializations. Our theoretical guarantees are derived via a rigorous analysis of the GD trajectories across the distinct activation patterns of the model. Specifically, we develop tight control over the system's dynamics to bound the trajectory of the decision boundary, overcoming the primary technical challenge introduced by the non-linear nature of the architecture.

研究动机与目标

理解优化动力学如何影响神经网络的对抗鲁棒性。
研究在一个最小的非线性模型上，梯度下降（GD）的收敛行为与鲁棒性边际。
表征训练过程中激活模式和神经元专门化的变化，与收敛速率的关系，并与梯度流（GF）进行比较。
提供实证证据，表明这种缓慢收敛在不同初始化和设置下仍然存在。

提出的方法

分析一个深度为2、宽度为2的 ReLU 网络，固定输出权重，在隐藏层参数上进行训练。
使用指数损失定义经验风险，并研究 GF 与 GD 动态。
在训练过程中表征激活模式和神经元专门化。
在专门化条件下推导显式更新规则，以揭示平衡态与对鲁棒边际的偏向。
证明在 GD 下对最优鲁棒边际的收敛率为 Θ(1/ln t)。
通过实验补充理论，展示不同初始化下的慢速收敛。

实验结果

研究问题

RQ1梯度下降是否会在像两神经元 ReLU 这样的非线性模型中收敛到最大边际鲁棒解？
RQ2GD 收敛到鲁棒边际的有限时间速率是多少，与 GF 以及潜在加速方法相比如何？
RQ3激活模式和神经元专门化是否决定了观察到的收敛瓶颈？
RQ4在常见的初始化和训练范式下，缓慢收敛到鲁棒边际的现象是否可观测？

主要发现

GF 在这个最小设置中沿着方向收敛到最大化鲁棒边际的 KKT 点。
GD 也收敛到相同的鲁棒边际，但速率为 Θ(1/ln t)，使实际收敛极其缓慢。
确定鲁棒边际的交点 x⋆(t) 满足 x⋆(t) = (b2−b1)/(w1−w2)，分母随 Θ(ln t) 增长。
在几乎所有初始化下，边际缺口 γ⋆−γ(θ(t)) 的衰减速率为 Θ(1/ln t)，为非线性模型中鲁棒边际的第一条明确的慢速下界。
非渐近分析表明，在标准 He 初始化下，边际在初始阶段下降，只有以对数级缓慢的速度恢复，这对实现高鲁棒性在实践中并不高效。
10,000 次试验的结果显示，许多运行陷入慢速阶段，即使成功的运行也表现出与理论一致的缓慢边际收敛。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。