QUICK REVIEW

[論文レビュー] Compacter: Efficient Low-Rank Hypercomplex Adapter Layers

Rabeeh Karimi Mahabadi, James Henderson|arXiv (Cornell University)|Jun 8, 2021

Topic Modeling参考文献 64被引用数 82

ひとこと要約

Compacterは低ランクのハイパーコンプレックスアダプタ層を導入して大規模言語モデルを微調整し、全パラメータ微調整と同等以上のタスク性能を達成しつつ、学習するパラメータは極めて少数にとどまる（約0.047%）

ABSTRACT

Adapting large-scale pretrained language models to downstream tasks via fine-tuning is the standard method for achieving state-of-the-art performance on NLP benchmarks. However, fine-tuning all weights of models with millions or billions of parameters is sample-inefficient, unstable in low-resource settings, and wasteful as it requires storing a separate copy of the model for each task. Recent work has developed parameter-efficient fine-tuning methods, but these approaches either still require a relatively large number of parameters or underperform standard fine-tuning. In this work, we propose Compacter, a method for fine-tuning large-scale language models with a better trade-off between task performance and the number of trainable parameters than prior work. Compacter accomplishes this by building on top of ideas from adapters, low-rank optimization, and parameterized hypercomplex multiplication layers. Specifically, Compacter inserts task-specific weight matrices into a pretrained model's weights, which are computed efficiently as a sum of Kronecker products between shared "slow" weights and "fast" rank-one matrices defined per Compacter layer. By only training 0.047% of a pretrained model's parameters, Compacter performs on par with standard fine-tuning on GLUE and outperforms standard fine-tuning on SuperGLUE and low-resource settings. Our code is publicly available at~\url{https://github.com/rabeehk/compacter}.

研究の動機と目的

大規模事前学習済み言語モデル（PLMs）のメモリ効率・パラメータ効率の高い微調整を動機づける。
タスク性能を維持または向上させつつ訓練可能パラメータを削減するアダプタを開発する。
クーロネッカー/低ランク分解とハイパーコンプレックス乗算を活用してコンパクトなアダプタ層を作成する。
GLUEおよびSuperGLUE上で強力なベースラインと比較して経験的評価を行い、効率性のトレードオフを分析する。

提案手法

共有のスローウェイトと各Compacter層の高速ランク1行列とのクーロネッカー積の和として、事前学習済みモデルのウェイトにタスク特有の重み行列を挿入する。
高速成分の低ランクパラメータ化を用いて、アダプタのパラメータを O(k+d) に削減し、標準的なアダプタの O(kd) と比較する。
全層で A_i 行列を共有し、B_i を各層に特化させ、さらに B_i を rank r の s_i t_i^T に分解する（通常 r=1）。
アダプタのダウンプロジェクションとアッププロジェクションを LPHM（低ランクパラメータ化ハイパーコンプレックス乗算）層に置換する。
訓練中は固定された事前学習モデルを維持し、層正規化とアダプタを更新する（標準的なアダプタと同様）。
任意で Compacter ++ を評価し、自己注意後に各ブロックの Compacter 層を除去してさらなるパラメータ削減を行う。

実験結果

リサーチクエスチョン

RQ1Compacterははるかに少ないパラメータを訓練しつつ全微調整と同等の性能を達成できるか。
RQ2LPHMベースのアダプタはGLUE/SuperGLUEで標準アダプタや他のパラメータ効率的微調整法と比べてどうか。
RQ3異なる n と rank 設定でのメモリ・訓練時間・精度のトレードオフはどうなるか。
RQ4層間で A_i を共有し低ランクの B_i を使用するだけで、タスクとリソース設定間の性能を維持できるか。

主な発見

Method	#Total params	Trained params / per task	CoLA	SST-2	MRPC	QQP	STS-B	MNLI	QNLI	RTE	Avg
T5 BASE	8.0×1	100%	61.76	94.61	90.20/93.06	91.63/88.84	89.68/89.97	86.78	93.01	71.94	86.50
PHM-Adapter (n=12)	1.013	0.179%	57.35	94.50	91.67/93.86	90.25/87.05	90.45/90.84	85.97	92.92	75.54	86.40
Compacter (n=4)	1.004	0.073%	63.75	93.00	89.22/92.31	90.23/87.03	90.31/90.74	85.61	92.88	77.70	86.62
Compacter ++ (n=4)	1.002	0.047%	61.27	93.81	90.69/93.33	90.17/86.93	90.46/90.93	85.71	93.08	74.82	86.47

Compacterは GLUE および SuperGLUE で全微調整と同等かそれ以上の性能を、パラメータを 0.073%（Compacter）および 0.047%（Compacter ++）しか訓練しない形で達成する。
Compacterは GLUE/SuperGLUE ベンチマークで、アダプタ、Adapter-LowRank、プロンプト微調整などのいくつかのパラメータ効率的ベースラインを上回る。
LPHM ベースの層は、標準アダプタの O(kd) に対して O(k+d) のパラメータ数に削減され、層全体で A_i を共有し B_i を rank 1 の要因とする。
低リソース設定では、Compacter および Compacter ++ は、訓練可能パラメータがはるかに少ない状態で精度の点で標準微調整を上回ることがある。
Compacter ++ は多くの設定で評価した方法の中で最小のメモリフットプリントを維持しつつ、全微調整にほぼ対等な性能を達成できる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。