[论文解读] A Unified View of Attention and Residual Sinks: Outlier-Driven Rescaling is Essential for Transformer Training
论文主张注意力和残差流中的异常值通过归一化重新缩放其他组件,并引入门控式重新缩放(GatedNorm)以提升训练稳定性和量化鲁棒性,同时缓解残差汇聚问题。
We investigate the functional role of emergent outliers in large language models, specifically attention sinks (a few tokens that consistently receive large attention logits) and residual sinks (a few fixed dimensions with persistently large activations across most tokens). We hypothesize that these outliers, in conjunction with the corresponding normalizations ( extit{e.g.}, softmax attention and RMSNorm), effectively rescale other non-outlier components. We term this phenomenon extit{outlier-driven rescaling} and validate this hypothesis across different model architectures and training token counts. This view unifies the origin and mitigation of both sink types. Our main conclusions and observations include: (1) Outliers function jointly with normalization: removing normalization eliminates the corresponding outliers but degrades training stability and performance; directly clipping outliers while retaining normalization leads to degradation, indicating that outlier-driven rescaling contributes to training stability. (2) Outliers serve more as rescale factors rather than contributors, as the final contributions of attention and residual sinks are significantly smaller than those of non-outliers. (3) Outliers can be absorbed into learnable parameters or mitigated via explicit gated rescaling, leading to improved training performance (average gain of 2 points) and enhanced quantization robustness (1.2 points degradation under W4A4 quantization).
研究动机与目标
- 研究大模型中注意力汇聚与残差汇聚的功能作用。
- 证明异常值与归一化的相互作用能对非异常分量进行重新缩放。
- 指出去除归一化或简单裁剪异常值会损害训练稳定性与性能。
- 提出并评估以保留或替代异常值驱动的重新缩放的缓解策略,以提升训练与量化。
提出的方法
- 分析多模型与不同 token 数量下的注意力对数值与残差激活中的异常模式。
- 将异常驱动重新缩放的概念形式化为异常值与归一化(softmax 注意力和 RMSNorm)的相互作用。
- 通过去除归一化、裁剪异常值或改变激活函数进行消融,评估稳定性与性能。
- 引入 PreAffine RMSNorm,在归一化前将异常值吸收进可学习参数。
- 提出 GatedNorm,在归一化后引入显式门控机制,以实现重新缩放,同时降低异常值并提升量化鲁棒性。
实验结果
研究问题
- RQ1注意力与残差流中的异常值是否主要作为重新缩放因子,而非直接对输出的贡献?
- RQ2异常值驱动的重新缩放对稳定训练是否必需,是否可在不牺牲性能的情况下保留或替代?
- RQ3是否可以通过显式的重新缩放机制(如门控)缓解残差汇聚并提升对量化和架构选择的鲁棒性?
- RQ4不同的归一化与注意力变体(softmax、线性、门控)如何影响汇聚的形成与训练稳定性?
- RQ5是否可以将异常值吸收到参数中而不损失功能,其对模型容量与部署的权衡如何?
主要发现
- 异常值与归一化相互作用实现重新缩放,去除归一化会降低稳定性与性能。
- 异常值主要作为重新缩放因子;它们对输出的最终贡献比非异常值要小。
- 异常值可以被吸收到可学习参数中,或通过显式门控重新缩放进行缓解,从而提高训练与量化鲁棒性。
- GatedNorm 能减少残差汇聚,保持或提升性能,在 FP4 设置下具有更好的量化鲁棒性。
- 通过门控提供显式重新缩放,降低对异常值的依赖,使模型对架构选择不那么敏感,在激活与架构间实现更好的鲁棒性。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。