QUICK REVIEW

[论文解读] Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models

Xingyu Xie, Pan Zhou|arXiv (Cornell University)|Aug 13, 2022

Advanced Neural Network Applications被引用 63

一句话总结

Adan 引入 Nesterov 动量估计（NME），用于自适应估计一阶和二阶梯度矩量，在非凸随机优化中实现更快收敛，并在视觉、语言和强化学习任务中表现稳健。

ABSTRACT

In deep learning, different kinds of deep networks typically need different optimizers, which have to be chosen after multiple trials, making the training process inefficient. To relieve this issue and consistently improve the model training speed across deep networks, we propose the ADAptive Nesterov momentum algorithm, Adan for short. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method, which avoids the extra overhead of computing gradient at the extrapolation point. Then, Adan adopts NME to estimate the gradient's first- and second-order moments in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan finds an $ε$-approximate first-order stationary point within $\mathcal{O}(ε^{-3.5})$ stochastic gradient complexity on the non-convex stochastic problems (e.g., deep learning problems), matching the best-known lower bound. Extensive experimental results show that Adan consistently surpasses the corresponding SoTA optimizers on vision, language, and RL tasks and sets new SoTAs for many popular networks and frameworks, e.g., ResNet, ConvNext, ViT, Swin, MAE, DETR, GPT-2, Transformer-XL, and BERT. More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers to achieve higher or comparable performance on ViT, GPT-2, MAE, etc., and also shows great tolerance to a large range of minibatch size, e.g., from 1k to 32k. Code is released at https://github.com/sail-sg/Adan, and has been used in multiple popular deep learning frameworks or projects.

研究动机与目标

阐明需要一种优化器，在各种深度架构中持续提高训练速度。
开发一种高效的优化器，将 Nesterov 动量与自适应梯度方法结合，且无需额外梯度外推开销。
提供理论收敛性保证以及在视觉、语言和强化学习任务中的实证证据。
展示自适应正则化和解耦权重衰减在实践中提升泛化能力。

提出的方法

提出 Nesterov 动量估计（NME），在当前点计算梯度并构建修正后的梯度代理，以在不产生额外开销的情况下模拟外推。
用 gk' = gk + (1-β1)(gk−gk−1) 来定义一阶矩和二阶矩的更新，并将其整合到类似 Adam 的更新中，具有 m_k 和 n_k。
引入受近端（proximal）启发的解耦正则化步骤，该步骤以含有权重范数 n_k 的加权项的动态正则化器 Fk' 来最小化一阶近似。
给出算法细节（算法 1），包括重启条件以稳定动量并在实践中实现有效收敛。

实验结果

研究问题

RQ1Adan 的 NME 是否能在非凸随机问题上比现有的 Adam 型优化器实现更快的收敛？
RQ2在 Lipschitz 梯度和 Hessian 假设下，Adan 是否达到或接近随机梯度复杂度的理论下界？
RQ3Adan 是否在多样化的架构和训练设置中具有鲁棒性，包括大批量 minibatch 情况和不同数据集规模？
RQ4解耦权重衰减（如 AdamW）是否与 Adan 有效结合以提升泛化？
RQ5与 SoTA 优化器相比，Adan 在视觉、NLP 和强化学习基准上的表现如何？

主要发现

Adan 获得对 ε-近似的一阶点的随机梯度复杂度为 O(c∞^2.5 ε^-4)，在常数项内与已知下界相匹配。
在 Lipschitz Hessian 下，带重启的 Adan 实现 O(c∞^1.25 ε^-3.5) 的复杂度，也与下界一致并较多种先前方法有提升。
在经验上，Adan 在视觉、语言和 RL 任务中持续超越 SoTA 优化器，在若干体系结构上以大致一半的训练轮次成本实现更高或相当的性能。
Adan 对广泛的 minibatch 尺度（如 1k 到 32k）表现出鲁棒性，并在 ViT、GPT-2、MAE 等模型上具有良好扩展性。
该方法与解耦权重衰减（AdamW 风格）无缝集成，并显示出更好的泛化能力。
理论结果在不需要较大动量超参数的前提下成立，与实际训练设置（β1、β2 取小值）相符。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。