QUICK REVIEW

[论文解读] RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation

Mahdi Nikdan, Soroush Tabesh|arXiv (Cornell University)|Jan 9, 2024

Advanced Vision and Imaging被引用 6

一句话总结

RoSA 将低秩适配器与稀疏适配器相结合，在参数量分数极少的情况下实现接近全量微调的精度，在相同预算下性能优于 LoRA 和 Sparse Adaptation，甚至在某些任务上与 FFT 的表现相当。它还实现了高效的稀疏 GPU 内核和一个量化变体（QRoSA）。

ABSTRACT

We investigate parameter-efficient fine-tuning (PEFT) methods that can provide good accuracy under limited computational and memory budgets in the context of large language models (LLMs). We present a new PEFT method called Robust Adaptation (RoSA) inspired by robust principal component analysis that jointly trains $ extit{low-rank}$ and $ extit{highly-sparse}$ components on top of a set of fixed pretrained weights to efficiently approximate the performance of a full-fine-tuning (FFT) solution. Across a series of challenging generative tasks such as grade-school math and SQL query generation, which require fine-tuning for good performance, we show that RoSA outperforms LoRA, pure sparse fine-tuning, and alternative hybrid methods at the same parameter budget, and can even recover the performance of FFT on some tasks. We provide system support for RoSA to complement the training algorithm, specifically in the form of sparse GPU kernels which enable memory- and computationally-efficient training, and show that it is also compatible with low-precision base weights, resulting in the first joint representation combining quantization, low-rank and sparse approximations. Our code is available at https://github.com/IST-DASLab/RoSA.

研究动机与目标

Motivate PEFT under limited compute/memory for large language models.
Propose a robust adaptation method combining low-rank and sparse components to better approximate FFT updates.
Develop efficient system implementations for sparse and low-rank adapters on GPUs.
Demonstrate that RoSA can match FFT performance on challenging tasks at the same budget.

提出的方法

Formulate RoSA as joint optimization of a low-rank adapter and a sparse adapter added to pretrained weights.
Generate sparsity masks using a data-driven TopK-based mask generation procedure (Algorithm 1).
Train both low-rank and sparse adapters in parallel while keeping base weights frozen.
Integrate a sparse-dense backward pass with a specialized SDDMM kernel to exploit sparsity structure.
Extend RoSA with QRoSA by combining RoSA with weight quantization (QLoRA-compatible).
Provide a PyTorch-based system implementation with CSR-sparse storage and efficient kernel support for GPUs.

实验结果

研究问题

RQ1Can a low-rank plus sparse adaptation better approximate FFT updates than purely low-rank methods like LoRA for complex downstream tasks?
RQ2Do RoSA adapters provide higher accuracy at the same parameter/memory budget compared with LoRA and SpA across diverse tasks?
RQ3Is RoSA compatible with weight quantization to further improve efficiency without sacrificing accuracy?
RQ4What is the practical system performance of RoSA on GPU hardware with sparse backward/forward kernels?

主要发现

RoSA outperforms LoRA and Sparse Adaptation at the same budget across multiple tasks (GSM8k, ViGGO, SQL).
RoSA can match or even surpass FFT accuracy in single-epoch experiments on several datasets.
Extended training shows RoSA matching or exceeding FFT on GSM8k and ViGGO, and generally outperforming alternatives across budgets.
RoSA supports a joint representation with quantization (QRoSA) that further reduces memory while maintaining or improving accuracy on certain tasks.
Mask generation via data-driven gradient-based TopK methods yields effective sparsity patterns that outperform several alternative masking strategies.
System-level RoSA kernels provide memory- and compute-efficient backpropagation for sparse adapters, achieving speedups over prior sparse kernels.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。