QUICK REVIEW

[论文解读] Adam Converges Without Any Modification On Update Rules

Y. Q. Zhang, Bingran Li|arXiv (Cornell University)|Mar 2, 2026

Stochastic Gradient Optimization Techniques被引用 0

一句话总结

该论文证明在问题相关的超参数被选择时，原生 Adam 收敛，揭示 beta1–beta2 相变与小批量大小的依赖，并提供实际的调参建议。

ABSTRACT

Adam is the default algorithm for training neural networks, including large language models (LLMs). However, \citet{reddi2019convergence} provided an example that Adam diverges, raising concerns for its deployment in AI model training. We identify a key mismatch between the divergence example and practice: \citet{reddi2019convergence} pick the problem after picking the hyperparameters of Adam, i.e., $(β_1,β_2)$; while practical applications often fix the problem first and then tune $(β_1,β_2)$. In this work, we prove that Adam converges with proper problem-dependent hyperparameters. First, we prove that Adam converges when $β_2$ is large and $β_1 < \sqrt{β_2}$. Second, when $β_2$ is small, we point out a region of $(β_1,β_2)$ combinations where Adam can diverge to infinity. Our results indicate a phase transition for Adam from divergence to convergence when changing the $(β_1, β_2)$ combination. To our knowledge, this is the first phase transition in $(β_1,β_2)$ 2D-plane reported in the literature, providing rigorous theoretical guarantees for Adam optimizer. We further point out that the critical boundary $(β_1^*, β_2^*)$ is problem-dependent, and particularly, dependent on batch size. This provides suggestions on how to tune $β_1$ and $β_2$: when Adam does not work well, we suggest tuning up $β_2$ inversely with batch size to surpass the threshold $β_2^*$, and then trying $β_1< \sqrt{β_2}$. Our suggestions are supported by reports from several empirical studies, which observe improved LLM training performance when applying them.

研究动机与目标

推动经典发散结果与 Adam 在固定问题上的实际成功之间的差异的动机。
建立在不改变更新规则的情况下，原生 Adam 收敛的条件。
在 beta1–beta2 平面上表征一个发散–收敛的相变。
强调收敛对批量大小的依赖并为从业者提供调参指导。

提出的方法

在两种采样方案下分析 Adam：有放回采样和随机洗牌。
通过定理3.1（有放回）和定理3.3（随机洗牌）对任意问题类 F_L,D0,D1^n 给出非渐近收敛性结果。
证明在较大 beta2 且 beta1 < sqrt(beta2) 时，收敛到临界点（可实现）或收敛域内的邻域（不可实现）。
给出在较小 beta2 时的发散区域，并给出一个问题相关的边界 beta1*, beta2*。
解释围绕 1/sqrt(v_k) 的聚集分析，以在不假设有界梯度的情况下处理无界梯度。
通过将问题在选择 beta1,beta2 之前固定，与 Reddi 等（2018）进行比较，并揭示依赖于问题类和批量大小的相变。

(a) Divergent region claimed by (Reddi et al. , 2018 )

实验结果

研究问题

RQ1当问题固定且选择了合适的超参数时，原生 Adam 在不修改更新规则的情况下能否收敛？
RQ2beta1 和 beta2 如何影响有限和求和 ERM 问题中 Adam 的收敛/发散区域？
RQ3beta1–beta2 平面中是否存在将收敛与发散分隔开的相变？
RQ4批量大小和问题类参数如何影响收敛的临界边界？
RQ5来自一个将 beta1、beta2 与批量大小和问题类耦合的理论能给出哪些实际调参建议？

主要发现

存在一个收敛区域：若 0 ≤ beta1 < sqrt(beta2) < 1 且 beta2 超过一个与问题相关的阈值，Adam 收敛到临界点（可实现）或在不可实现时收敛到邻域。
存在一个发散区域：对于较小的 beta2，存在问题类别实例使 Adam 发散到无穷大，且边界随小批量数 n 增大而扩大（即批量越小越明显）。
在 beta1–beta2 平面存在一个将发散与收敛分隔开的相变，且边界依赖于问题类别和批量大小。
临界边界 (beta1*, beta2*) 是问题相关的，并且随批量大小成反比增长，意味着批量越小需要更大的 beta2。
在实证中，当 beta2 与批量大小配合调整时，较大 beta2 与改进的大规模语言模型预训练的训练效果相符；一旦 beta2 足够大，建议将 beta1 调整为小于 sqrt(beta2)。
分析覆盖两种采样方案（有放回和随机洗牌），且不假设有界梯度，从而能够提供对无界梯度行为的洞见。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。