QUICK REVIEW

[论文解读] Where Does Warm-Up Come From? Adaptive Scheduling for Norm-Constrained Optimizers

Artem Riabinin, Andrey Veprikov|arXiv (Cornell University)|Feb 5, 2026

Stochastic Gradient Optimization Techniques被引用 0

一句话总结

本论文提供了一个理论框架和一个用于范数约束优化器（如 Muon、Lion、normSGD）的实用自适应热身调度器。它引入了一个广义的光滑性假设，即局部曲率随次优性上升而增加，证明热身是自然出现的，并且展示自适应热身在LLM预训练中无需额外超参数搜索即可提升性能。

ABSTRACT

We study adaptive learning rate scheduling for norm-constrained optimizers (e.g., Muon and Lion). We introduce a generalized smoothness assumption under which local curvature decreases with the suboptimality gap and empirically verify that this behavior holds along optimization trajectories. Under this assumption, we establish convergence guarantees under an appropriate choice of learning rate, for which warm-up followed by decay arises naturally from the proof rather than being imposed heuristically. Building on this theory, we develop a practical learning rate scheduler that relies only on standard hyperparameters and adapts the warm-up duration automatically at the beginning of training. We evaluate this method on large language model pretraining with LLaMA architectures and show that our adaptive warm-up selection consistently outperforms or at least matches the best manually tuned warm-up schedules across all considered setups, without additional hyperparameter search. Our source code is available at https://github.com/brain-lab-research/llm-baselines/tree/warmup

研究动机与目标

动机并为范数约束优化器的热身提供理由，超越经验性启发式。
引入将曲率与次优性联系起来的广义光滑模型。
在该模型下证明使用热身-再衰减学习率的收敛性。
开发依赖标准超参数的实用自适应热身调度器。
在大语言模型的预训练中验证调度器，并显示具有竞争力的表现。

提出的方法

给出基于 LMO 的更新 x^{t+1}=x^{t}+ eta^{t} LMO(g^{t})，并将其与二次损失近似相关联。
提出假设 2：（ ho, K0, K1, Kρ）-光滑性，其曲率上界依赖于子最优性 f(x)-f^{*}。
证明在学习率 eta^{t}= (illed{Δ^{t}})/(D·K(x^{t})) 下，子最优性 Δ^{t} 将下降且 K(x^{t}) 下降（定理 1）。
将权重衰减扩展为更新 x^{t+1}=(1−λη^{t})x^{t}+η^{t} LMO(g^{t}) 并证明收敛性（定理 2）。
将其扩展到带放缩梯度归一化的随机设定，并给出插值假设（定理 3）。
通过在受约束下拟合三参模型 η(Δ)=Δ/(K0+K1Δ+K2Δ^2) 并在约束条件下实现，得到一个实用的自适应热身调度器，并在 Δ′ 后切换到衰减（算法 5）。

实验结果

研究问题

RQ1是否可以将针对基于 LMO 的优化器的学习率热身理论化，而不仅仅是经验性？
RQ2在训练开始阶段是否可以自动调整热身时长，而无需手动调参？
RQ3子最优性相关的光滑模型是否能自然解释热身与衰减作为优化的动态？
RQ4自适应热身调度器在大规模 LLM 预训练中是否无需超参数搜索也能取得良好表现？

主要发现

具有广义光滑性模型，曲率依赖于子最优性间隙，在优化轨迹上得到实证支持。
在该模型下，热身然后衰减在基于 LMO 的优化器的收敛性证明中自然涌现。
推导出一个实用的自适应热身调度器，使用标准超参数并在训练开始时估计 K0、K1、K2 和 Δ′。
自适应热身在 LLaMA 预训练中与或优于经人工调参的热身方案，且无需额外的超参数搜索，涵盖 Muon、Lion、和 normSGD。
该方法在不同模型规模与批量设置下仍然鲁棒，且在小批量设置中特别具有优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。