QUICK REVIEW

[论文解读] LoRA-FA: Memory-efficient Low-rank Adaptation for Large Language Models Fine-tuning

Longteng Zhang, Lin Zhang|arXiv (Cornell University)|Aug 7, 2023

Topic Modeling被引用 10

一句话总结

LoRA-FA 在更新仅 B 的同时冻结预训练的 W 和 A 投影，使权重变动位于低秩空间，大幅降低激活内存和可训练参数，接近完整微调和 LoRA 的性能。

ABSTRACT

The low-rank adaptation (LoRA) method can largely reduce the amount of trainable parameters for fine-tuning large language models (LLMs), however, it still requires expensive activation memory to update low-rank weights. Reducing the number of LoRA layers or using activation recomputation could harm the fine-tuning performance or increase the computational overhead. In this work, we present LoRA-FA, a memory-efficient fine-tuning method that reduces the activation memory without performance degradation and expensive recomputation. LoRA-FA chooses to freeze the projection-down weight of $A$ and update the projection-up weight of $B$ in each LoRA layer. It ensures the change of model weight reside in a low-rank space during LLMs fine-tuning, while eliminating the requirement to store full-rank input activations. We conduct extensive experiments across multiple model types (RoBERTa, T5, LLaMA) and model scales. Our results show that LoRA-FA can always achieve close fine-tuning accuracy across different tasks compared to full parameter fine-tuning and LoRA. Furthermore, LoRA-FA can reduce the overall memory cost by up to 1.4$ imes$ compared to LoRA.

研究动机与目标

推动对大型语言模型（LLMs）进行超越标准 LoRA 限制的节省内存的微调。
提出 LoRA-FA：冻结 W 和 A，只更新 B 以降低激活内存。
证明 LoRA-FA 能在不同模型与任务上维持与常规模型微调相当的准确性。
展示 LoRA-FA 在不同模型规模和任务上的内存节省和鲁棒性。

提出的方法

通过冻结 W 和降维投影 A，并仅更新升维投影 B 来形式化 LoRA-FA。
将权重变化表示为 ΔW = AB，其中 A 固定，B 学习，将 ΔW 限制在 A 的列空间内。
分析内存效率：可训练参数规模为 n_r/2 = 9drL，激活内存取决于低秩输入 XA。
证明 LoRA-FA 仅需要存储 XA 的低秩输入来计算 ∂B，从而减少激活内存。
讨论与其它内存优化技术（量化、分片、选择性重新计算）兼容性。

实验结果

研究问题

RQ1LoRA-FA 是否能在编码器、编码器-解码器和解码器独立模型中，达到接近全参数微调和 LoRA 的微调性能？
RQ2冻结 A 和 W 在多大程度上降低激活内存和微调过程中的整体内存使用？
RQ3LoRA-FA 对超参数选择（如秩 r 和学习率 η）在不同模型家族和任务下有多鲁棒？
RQ4在实际中，LoRA-FA 如何与其他内存优化技术（量化、分片、激活重新计算）互动？

主要发现

LoRA-FA 在 RoBERTa、T5 和 LLaMA 任务中达到接近全参数微调和 LoRA 的准确性。
LoRA-FA 将可训练参数减少到 RoBERTa-base 的约 1.5% 和 RoBERTa-large 的 1.0–1.6%，具体取决于设置。
LoRA-FA 相比 LoRA 和 FT 可以降低峰值 GPU 内存使用量，报道的实例包括将 LLaMA-7B 的内存从 56GB 降至 27.5GB，且在若干配置下平均节省 3–7GB。
在秩为 4 的 LLaMA-65B 的线性层中，LoRA-FA 在激活存储方面最多可以将激活内存降低高达 2048x。
LoRA-FA 对超参数具有鲁棒性，在改变秩和学习率时，与 LoRA 的性能模式相似。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。