QUICK REVIEW

[论文解读] Why Adam Works Better with $β_1 = β_2$: The Missing Gradient Scale Invariance Principle

Alberto Fernández-Hernández, Cristian Pérez-Corral|arXiv (Cornell University)|Jan 29, 2026

Stochastic Gradient Optimization Techniques被引用 0

一句话总结

该论文证明 Adam 是一阶梯度尺度不变性，当且仅当 β1 = β2 时成立，并显示这会带来更平滑、更新更稳定的结果，实验涵盖视觉与语言任务的多种架构与数据集。

ABSTRACT

Adam has been at the core of large-scale training for almost a decade, yet a simple empirical fact remains unaccounted for: both validation scores and the qualitative behaviour of the training runs improve when the momentum parameters satisfy $β_{1}=β_{2}$. Some recent studies have reported this pattern, but there is still no explanation for why this choice helps. We show that this choice is closely tied to a structural property that we refer to as extit{gradient scale invariance}. We formalize this notion and prove that Adam becomes gradient scale invariant of first order if and only if $β_{1}=β_{2}$. This perspective places the balanced regime of Adam in direct alignment with the design principles underlying several recent optimizers that explicitly enforce scale-robust updates. The theory is supported by experiments across vision and language tasks, and across different architectural families, in which rescaling the gradient has a markedly smoother effect on the update when $β_{1}=β_{2}$. Overall, our results offer a coherent explanation for an open question in the behavior of Adam and provide a simple principle that helps guide the design of future optimizers.

研究动机与目标

理解为何将动量参数（β1 = β2）绑定在一起能提升 Adam 的稳定性与性能的动机。
将梯度尺度不变性形式化为与 Adam 更新相关的结构性质。
证明当且仅当 β1 = β2 时，Adam 实现一阶梯度尺度不变性。
通过理论与实证分析，将平衡版 Adam 与现代尺度鲁棒优化器设计联系起来。

提出的方法

引入梯度尺度不变性并为更新规则给出形式定义。
将离散更新推导为连续时间的 Adam 流，以分析对梯度尺度的依赖。
对 m、v 及归一化更新 R 在梯度漂移 δ(t) 下进行一阶展开。
证明若且仅若 τ1 = τ2（等价于 β1 = β2），Adam 是一阶梯度尺度不变的。
利用合成实验与在视觉与语言模型上的真实训练过程来验证理论（多种体系结构与数据集）。
通过更新范数的振荡来量化更新稳定性，并在不同 β1, β2 配置间进行比较。

Figure 1 : Evolution of $\|\mathbf{R}_{k}\|$ in Adam for $\beta_{1}=\beta_{2}$ .

实验结果

研究问题

RQ1为什么将 β1 与 β2 绑定在一起（β1 = β2）会稳定 Adam 的更新并在跨任务中提升性能？
RQ2当 β1 ≠ β2 时，梯度尺度如何影响 Adam 的更新，在何种条件下一阶尺度不变性成立？
RQ3是否可以将梯度尺度不变性的概念统一为将 Adam 与在实践中观察到的更现代的尺度鲁棒优化器？
RQ4在跨体系结构的训练动力学中，存在的一阶梯度尺度不变性的经验性标记是什么？

主要发现

当且仅当 β1 = β2 时，Adam 是一阶梯度尺度不变的（在连续时间流中 τ1 = τ2）。
当 β1 = β2 时，更新对梯度大小的主导依赖性消失，产生由梯度方向驱动的更稳定的更新。
合成与真实模型的实验表明，在视觉与语言任务中，当 β1 = β2 时，更新范数更平滑、振荡更小。
跨多种架构/数据集的经验性振荡分析表明 β1 = β2 的对角线性配置显著降低更新振荡，统计意义显著。
结果将平衡版 Adam 放在尺度鲁棒优化器的更广泛框架内，并为未来方法提供原则性设计指南。

Figure 2 : Evolution of $\|\mathbf{R}_{k}\|$ in Adam for $\beta_{1}\neq\beta_{2}$ .

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。