QUICK REVIEW

[论文解读] Compacter: Efficient Low-Rank Hypercomplex Adapter Layers

Rabeeh Karimi Mahabadi, James Henderson|arXiv (Cornell University)|Jun 8, 2021

Topic Modeling参考文献 64被引用 82

一句话总结

Compacter 为大语言模型微调引入低秩、超复向量适配层，在仅使用极小一部分参数（约0.047%）的情况下，达到与完整微调相当或更好的任务性能。

ABSTRACT

Adapting large-scale pretrained language models to downstream tasks via fine-tuning is the standard method for achieving state-of-the-art performance on NLP benchmarks. However, fine-tuning all weights of models with millions or billions of parameters is sample-inefficient, unstable in low-resource settings, and wasteful as it requires storing a separate copy of the model for each task. Recent work has developed parameter-efficient fine-tuning methods, but these approaches either still require a relatively large number of parameters or underperform standard fine-tuning. In this work, we propose Compacter, a method for fine-tuning large-scale language models with a better trade-off between task performance and the number of trainable parameters than prior work. Compacter accomplishes this by building on top of ideas from adapters, low-rank optimization, and parameterized hypercomplex multiplication layers. Specifically, Compacter inserts task-specific weight matrices into a pretrained model's weights, which are computed efficiently as a sum of Kronecker products between shared "slow" weights and "fast" rank-one matrices defined per Compacter layer. By only training 0.047% of a pretrained model's parameters, Compacter performs on par with standard fine-tuning on GLUE and outperforms standard fine-tuning on SuperGLUE and low-resource settings. Our code is publicly available at~\url{https://github.com/rabeehk/compacter}.

研究动机与目标

Motivate memory-efficient, parameter-efficient fine-tuning of large-scale pretrained language models (PLMs).
Develop adapters that reduce trainable parameters while maintaining or improving task performance on NLP benchmarks.
Leverage Kronecker/low-rank decompositions and hypercomplex multiplication to create compact adapter layers.
Empirically evaluate on GLUE and SuperGLUE against strong baselines and analyze efficiency trade-offs.

提出的方法

Insert task-specific weight matrices into pretrained model weights via a sum of Kronecker products between shared slow weights and fast rank-one matrices per Compacter layer.
Use low-rank parameterization of the fast components to reduce parameters to O(k+d) compared to adapters' O(kd).
Share the A_i matrices across all layers while making B_i specific to each layer, and further factor B_i into s_i t_i^T with rank r (typically r=1).
Replace down-projection and up-projection in adapters with LPHM (low-rank parameterized hypercomplex multiplication) layers.
Maintain a fixed pretrained model during training and update layer norms and adapters (as in standard adapters).
Optionally evaluate Compacter ++ by removing the Compacter layer after self-attention in each block for further parameter reductions.] ,
research_questions [
Can Compacter achieve parity with full fine-tuning while training orders of magnitude fewer parameters?
How do LPHM-based adapters compare to standard adapters and other parameter-efficient fine-tuning methods in GLUE/SuperGLUE?
What are the memory, training time, and accuracy trade-offs of Compacter across different n and rank configurations?
Is sharing A_i across layers and using low-rank B_i sufficient to retain performance across tasks and resource settings?

实验结果

研究问题

RQ1Can Compacter achieve parity with full fine-tuning while training orders of magnitude fewer parameters?
RQ2How do LPHM-based adapters compare to standard adapters and other parameter-efficient fine-tuning methods in GLUE/SuperGLUE?
RQ3What are the memory, training time, and accuracy trade-offs of Compacter across different n and rank configurations?
RQ4Is sharing A_i across layers and using low-rank B_i sufficient to retain performance across tasks and resource settings?

主要发现

方法	#总参数	每任务训练参数	CoLA	SST-2	MRPC	QQP	STS-B	MNLI	QNLI	RTE	平均
T5 BASE	8.0×1	100%	61.76	94.61	90.20/93.06	91.63/88.84	89.68/89.97	86.78	93.01	71.94	86.50
PHM-Adapter (n=12)	1.013	0.179%	57.35	94.50	91.67/93.86	90.25/87.05	90.45/90.84	85.97	92.92	75.54	86.40
Compacter (n=4)	1.004	0.073%	63.75	93.00	89.22/92.31	90.23/87.03	90.31/90.74	85.61	92.88	77.70	86.62
Compacter ++ (n=4)	1.002	0.047%	61.27	93.81	90.69/93.33	90.17/86.93	90.46/90.93	85.71	93.08	74.82	86.47

Compacter achieves comparable or better performance than full fine-tuning on GLUE and SuperGLUE while training only 0.073% of parameters (Compacter) and 0.047% (Compacter ++).
Compacter outperforms several parameter-efficient baselines (including adapters, Adapter-LowRank, and prompt-tuning variants) on GLUE/SuperGLUE benchmarks.
LPHM-based layers reduce parameter count to O(k+d) versus O(kd) for standard adapters, with shared A_i across layers and rank-one B_i factors.
In low-resource settings, Compacter and Compacter ++ can outperform standard fine-tuning in terms of accuracy with far fewer trainable parameters.
Compacter ++ can achieve near-parity with full fine-tuning while using the smallest memory footprint among the evaluated methods in many settings.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。