QUICK REVIEW

[論文レビュー] PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models

Fan-Xu Meng, Zhaohui Wang|arXiv (Cornell University)|Apr 3, 2024

Topic Modeling被引用数 6

ひとこと要約

PiSSA は、事前学習重みの主成分の特異値/ベクトルから trainable adapters を初期化し、残差成分を凍結することで、訓練可能パラメータがはるかに少なくても、全ファインチューニングの性能を達成または凌駕する。

ABSTRACT

To parameter-efficiently fine-tune (PEFT) large language models (LLMs), the low-rank adaptation (LoRA) method approximates the model changes $ΔW \in \mathbb{R}^{m imes n}$ through the product of two matrices $A \in \mathbb{R}^{m imes r}$ and $B \in \mathbb{R}^{r imes n}$, where $r \ll \min(m, n)$, $A$ is initialized with Gaussian noise, and $B$ with zeros. LoRA freezes the original model $W$ and updates the "Noise & Zero" adapter, which may lead to slow convergence. To overcome this limitation, we introduce Principal Singular values and Singular vectors Adaptation (PiSSA). PiSSA shares the same architecture as LoRA, but initializes the adaptor matrices $A$ and $B$ with the principal components of the original matrix $W$, and put the remaining components into a residual matrix $W^{res} \in \mathbb{R}^{m imes n}$ which is frozen during fine-tuning. Compared to LoRA, PiSSA updates the principal components while freezing the "residual" parts, allowing faster convergence and enhanced performance. Comparative experiments of PiSSA and LoRA across 12 different models, ranging from 184M to 70B, encompassing 5 NLG and 8 NLU tasks, reveal that PiSSA consistently outperforms LoRA under identical experimental setups. On the GSM8K benchmark, Mistral-7B fine-tuned with PiSSA achieves an accuracy of 72.86%, surpassing LoRA's 67.7% by 5.16%. Due to the same architecture, PiSSA is also compatible with quantization to further reduce the memory requirement of fine-tuning. Compared to QLoRA, QPiSSA exhibits smaller quantization errors in the initial stages. Fine-tuning LLaMA-3-70B on GSM8K, QPiSSA attains an accuracy of 86.05%, exceeding the performances of QLoRA at 81.73%. Leveraging a fast SVD technique, PiSSA can be initialized in only a few seconds, presenting a negligible cost for transitioning from LoRA to PiSSA. Code is available at https://github.com/GraphPKU/PiSSA.

研究の動機と目的

大規模言語モデルのファインチューニングコストを低減する動機づけとして、低い固有次元を活用する。
事前学習重みの主成分 SVD コンポーネントからアダプターを初期化する PiSSA を導入する。
より少ない訓練可能パラメータで PiSSA が全ファインチューニングの性能を達成または凌駕することを示す。
PiSSA が量子化耐性を向上させ、LoRA よりも収束が速いことを示す。）

提案手法

自己注意層とMLP層の重み行列 W の経済サイズ SVD を計算する。
W を特異値に基づいて主成分(W pri) と残差(W res) 成分に分割する。
アダプタ行列 A と B を主成分特異ベクトルから初期化: A = U[:r] diag(S[:r]^{1/2}), B = diag(S[:r]^{1/2}) V[:r]^T。
残りの特異三つ組から残差行列 W^{res} を構築し、ファインチューニング中は凍結する。
W を W^{res} + AB と表現し、W^{res} を凍結したまま AB を訓練する。
初期化を速くするために高速 SVD を任意に活用するオプションが、性能を維持する。

実験結果

リサーチクエスチョン

RQ1PiSSA は多様なモデルとタスクで LoRA および全ファインチューニングを上回るか？
RQ2PiSSA が量子化ベースの PEFT（例: QLoRA, LoftQ）と量子化誤差と最終性能の点でどのように相互作用するか？
RQ3アダプター順位（ランク）の違いが収束速度と一般化に与える影響は？
RQ4fast SVD は初期化速度を大幅に向上させつつ性能を害することなく提供できるか？

主な発見

モデル	戦略	訓練可能	GSM8K	MATH	HumanEval	MBPP	MT-Bench	パラメータ
LLaMA 2-7B	全ファインチューニング	6738M	49.05	7.22	21.34	35.59	4.91	6738M
LLaMA 2-7B	LoRA	320M	42.3	5.5	18.29	35.34	4.58	320M
LLaMA 2-7B	PiSSA	320M	53.07	7.44	21.95	37.09	4.87	320M
Mistral-7B	全ファインチューニング	7242M	67.02	18.6	45.12	51.38	4.95	7242M
Mistral-7B	LoRA	168M	67.7	19.68	43.9	58.39	4.9	168M
Mistral-7B	PiSSA	168M	72.86	21.54	46.95	62.66	5.34	168M
Gemma-7B	全ファインチューニング	8538M	71.34	22.74	46.95	55.64	5.4	8538M
Gemma-7B	LoRA	200M	74.9	31.28	53.66	65.41	4.98	200M
Gemma-7B	PiSSA	200M	77.94	31.94	54.27	66.17	5.64	200M

PiSSA は tested されたすべてのモデル（LLaMA 2-7B、Mistral-7B-v0.1、Gemma-7B）およびタスクで一貫して LoRA を上回る。
GSM8K with Mistral-7B で PiSSA は 72.86% の精度、LoRA の 67.70% に比べて 5.16 ポイント高い。
PiSSA は複数のモデルで GSM8K/MATH で強い性能を示し、同じ訓練可能パラメータ予算でしばしば全ファインチューニングを上回る。
PiSSA は 4-bit LLaMA 2-7B の量子化誤差を 18.97% 減少させ、下流のファインチューニング性能を改善。
ランクアブレーション研究で、PiSSA は訓練可能パラメータを少なくしても全パラメータファインチューニングと同等以上の性能を達成/上回り、LoRA より速く収束。
Fast SVD を初期化に使用すると、正確な SVD と比べ初期化時間を大幅に短縮しつつ競争力のある性能を維持。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。