Skip to main content
QUICK REVIEW

[论文解读] Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels

Alexandra Zelenin, Alexandra Zhuravlyova|arXiv (Cornell University)|Mar 23, 2026
Parallel Computing and Optimization Techniques被引用 0
一句话总结

本文提出一种对高秩 DoRA 的因式化范数方法,以避免显式产生稠密的 B@A,并给出融合的 Triton 内核实现,在多 GPU 下对六个视觉-语言模型的前向/反向传播进行加速并降低显存占用的验证。

ABSTRACT

Weight-Decomposed Low-Rank Adaptation (DoRA) extends LoRA by decoupling weight magnitude from direction, but its forward pass requires the row-wise norm of W + sBA, a computation that every major framework we surveyed implements by materializing the dense [d_out, d_in] product BA. At d_in = 8192 and rank r = 384, a single module's norm requires about 512 MB of transient working memory in bf16, making high-rank DoRA costly and often infeasible on common single-GPU setups once hundreds of adapted modules and checkpointing are involved. We present two systems contributions. A factored norm decomposes the squared norm into base, cross, and Gram terms computable through O(d_out r + r^2) intermediates, eliminating the dense product. Fused Triton kernels collapse the four-kernel DoRA composition into a single pass, reducing memory traffic by about 4x and using a numerically stable form that avoids catastrophic cancellation in the near-unity rescaling regime where magnitude scales concentrate in practice. Across six 8-32B vision-language models (VLMs) on three NVIDIA GPUs (RTX 6000 PRO, H200, B200) at r = 384 in bf16, the fused implementation is 1.5-2.0x faster than Hugging Face PEFT's DoRA implementation for inference and 1.5-1.9x faster for gradient computation (optimizer step excluded), with up to 7 GB lower peak VRAM. Microbenchmarks on six GPUs spanning four architecture generations (L40S, A100, RTX 6000 PRO, H200, B200, B300) confirm 1.5-2.7x compose-kernel speedup. Final-logit cosine similarity exceeds 0.9999 across all model/GPU pairs, and multi-seed training curves match within 7.1 x 10^-4 mean per-step loss delta over 2000 steps.

研究动机与目标

  • 动机高秩 DoRA 参数高效微调中的内存与速度瓶颈。
  • 开发一种内存高效的因式化范数形式,用于 ||W + sBA||_row^2,避免稠密 B@A 的材料化。
  • 设计融合的 Triton 内核,将 DoRA 流程中的多内核组合折叠成单次前向/反向计算。
  • 提供运行时调度策略,并兼容常见的 PEFT 框架与分布式训练体系。
  • 在多种 GPU 架构与模型规模上,实证验证内存、速度、保真度与收敛性。

提出的方法

  • 推导行范数的因式分解,将其分解为基础项、交叉项和 BA-范数项,且无需材料化 B@A 即可计算(复杂度 O(d_out r + r^2))。
  • 通过分块的 fp32 累积并显式转换精度,组装逐行范数(式子 2-6)。
  • 在 Triton 中实现四个内核(两个前向/后向、一个范数组装、一个组合内核),将 DoRA 流程融合为单次通过并确保数值稳定性。
  • 提供三层运行时调度,根据硬件和张量形状选择融合的反向(训练)、融合的前向(推理)或急切回退路径。
  • 确保与 torch.compile、梯度检查点、Deepspeed ZeRO 和 FSDP 的兼容性,并对数量级分割处理以维持精度。
  • 在六个 GPU 上对六个 8–32B 规模的视觉-语言模型进行微基准和模型级基准测试。
Figure 1 : The stable compose form achieves $3.0$ × lower peak error near $g\approx 1$ (bf16, $d_{\text{out}}=8192$ , $d_{\text{in}}=2048$ ). The naive form $g\odot(s\cdot\text{lora}+\text{base})-\text{base}$ exhibits catastrophic cancellation; the stable form and fused kernel both remain near the b
Figure 1 : The stable compose form achieves $3.0$ × lower peak error near $g\approx 1$ (bf16, $d_{\text{out}}=8192$ , $d_{\text{in}}=2048$ ). The naive form $g\odot(s\cdot\text{lora}+\text{base})-\text{base}$ exhibits catastrophic cancellation; the stable form and fused kernel both remain near the b

实验结果

研究问题

  • RQ1DoRA 行范数是否可在不材料化稠密 B@A 的情况下计算,从而实现可扩展的高秩自适应?
  • RQ2融合内核是否降低内存带宽消耗并提高 DoRA 前向/后向在多样 GPU 与模型规模上的吞吐量?
  • RQ3与标准 DoRA 实现相比,因式化范数方法对数值稳定性与训练收敛性有何影响?
  • RQ4所提系统在保持保真度的同时,如何与现有 PEFT 框架和分布式训练工具链集成?
  • RQ5在不同秩 r 和模型规模(r=384–768,8–32B 模型)下的内存与速度权衡如何?

主要发现

  • 因式化范数将与秩相关的持久内存从 O(d_out d_in) 降至 O(d_out r + r^2),并消除了材料化 B@A 的需要。
  • 融合的 Triton 内核将四个 DoRA 操作折叠为一个单次传递,前向速度提升 1.5–2.7×,后向速度提升 1.06–1.23×,峰值显存降低最多 7 GB。
  • 在六个 8–32B 的视觉语言模型、三张显卡上,梯度计算比 HF PEFT 的 DoRA 基线快 1.46–1.87×,比作者提出的急切基线快 1.18–1.24×,推理速度提升 1.5–2.0×。
  • 在所有模型/显卡对上,融合后的最终 logits 与急切基线的余弦相似度超过 0.9999,显示高保真度。
  • 在多种种子下,使用融合内核的收敛性在 2000 步内的每步损失差异平均为 7.1e-4,与急切训练相当。
  • 内存形状分析表明,融合的后向在峰值显存方面降低,并在同一内存预算内支持更大配置的训练。
Figure 2 : Three-tier dispatch: fused backward for training (Tier 1), fused forward for inference (Tier 2), eager fallback for CPU, no-Triton, or sub-crossover shapes (Tier 3).
Figure 2 : Three-tier dispatch: fused backward for training (Tier 1), fused forward for inference (Tier 2), eager fallback for CPU, no-Triton, or sub-crossover shapes (Tier 3).

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。