Skip to main content
QUICK REVIEW

[论文解读] Compacter: Efficient Low-Rank Hypercomplex Adapter Layers

Rabeeh Karimi Mahabadi, James Henderson|arXiv (Cornell University)|Jun 8, 2021
Topic Modeling参考文献 64被引用 82
一句话总结

Compacter 为大语言模型微调引入低秩、超复向量适配层,在仅使用极小一部分参数(约0.047%)的情况下,达到与完整微调相当或更好的任务性能。

ABSTRACT

Adapting large-scale pretrained language models to downstream tasks via fine-tuning is the standard method for achieving state-of-the-art performance on NLP benchmarks. However, fine-tuning all weights of models with millions or billions of parameters is sample-inefficient, unstable in low-resource settings, and wasteful as it requires storing a separate copy of the model for each task. Recent work has developed parameter-efficient fine-tuning methods, but these approaches either still require a relatively large number of parameters or underperform standard fine-tuning. In this work, we propose Compacter, a method for fine-tuning large-scale language models with a better trade-off between task performance and the number of trainable parameters than prior work. Compacter accomplishes this by building on top of ideas from adapters, low-rank optimization, and parameterized hypercomplex multiplication layers. Specifically, Compacter inserts task-specific weight matrices into a pretrained model's weights, which are computed efficiently as a sum of Kronecker products between shared "slow" weights and "fast" rank-one matrices defined per Compacter layer. By only training 0.047% of a pretrained model's parameters, Compacter performs on par with standard fine-tuning on GLUE and outperforms standard fine-tuning on SuperGLUE and low-resource settings. Our code is publicly available at~\url{https://github.com/rabeehk/compacter}.

研究动机与目标

  • Motivate memory-efficient, parameter-efficient fine-tuning of large-scale pretrained language models (PLMs).
  • Develop adapters that reduce trainable parameters while maintaining or improving task performance on NLP benchmarks.
  • Leverage Kronecker/low-rank decompositions and hypercomplex multiplication to create compact adapter layers.
  • Empirically evaluate on GLUE and SuperGLUE against strong baselines and analyze efficiency trade-offs.

提出的方法

  • Insert task-specific weight matrices into pretrained model weights via a sum of Kronecker products between shared slow weights and fast rank-one matrices per Compacter layer.
  • Use low-rank parameterization of the fast components to reduce parameters to O(k+d) compared to adapters' O(kd).
  • Share the A_i matrices across all layers while making B_i specific to each layer, and further factor B_i into s_i t_i^T with rank r (typically r=1).
  • Replace down-projection and up-projection in adapters with LPHM (low-rank parameterized hypercomplex multiplication) layers.
  • Maintain a fixed pretrained model during training and update layer norms and adapters (as in standard adapters).
  • Optionally evaluate Compacter ++ by removing the Compacter layer after self-attention in each block for further parameter reductions.] ,
  • research_questions [
  • Can Compacter achieve parity with full fine-tuning while training orders of magnitude fewer parameters?
  • How do LPHM-based adapters compare to standard adapters and other parameter-efficient fine-tuning methods in GLUE/SuperGLUE?
  • What are the memory, training time, and accuracy trade-offs of Compacter across different n and rank configurations?
  • Is sharing A_i across layers and using low-rank B_i sufficient to retain performance across tasks and resource settings?

实验结果

研究问题

  • RQ1Can Compacter achieve parity with full fine-tuning while training orders of magnitude fewer parameters?
  • RQ2How do LPHM-based adapters compare to standard adapters and other parameter-efficient fine-tuning methods in GLUE/SuperGLUE?
  • RQ3What are the memory, training time, and accuracy trade-offs of Compacter across different n and rank configurations?
  • RQ4Is sharing A_i across layers and using low-rank B_i sufficient to retain performance across tasks and resource settings?

主要发现

方法#总参数每任务训练参数CoLASST-2MRPCQQPSTS-BMNLIQNLIRTE平均
T5 BASE8.0×1100%61.7694.6190.20/93.0691.63/88.8489.68/89.9786.7893.0171.9486.50
PHM-Adapter (n=12)1.0130.179%57.3594.5091.67/93.8690.25/87.0590.45/90.8485.9792.9275.5486.40
Compacter (n=4)1.0040.073%63.7593.0089.22/92.3190.23/87.0390.31/90.7485.6192.8877.7086.62
Compacter ++ (n=4)1.0020.047%61.2793.8190.69/93.3390.17/86.9390.46/90.9385.7193.0874.8286.47
  • Compacter achieves comparable or better performance than full fine-tuning on GLUE and SuperGLUE while training only 0.073% of parameters (Compacter) and 0.047% (Compacter ++).
  • Compacter outperforms several parameter-efficient baselines (including adapters, Adapter-LowRank, and prompt-tuning variants) on GLUE/SuperGLUE benchmarks.
  • LPHM-based layers reduce parameter count to O(k+d) versus O(kd) for standard adapters, with shared A_i across layers and rank-one B_i factors.
  • In low-resource settings, Compacter and Compacter ++ can outperform standard fine-tuning in terms of accuracy with far fewer trainable parameters.
  • Compacter ++ can achieve near-parity with full fine-tuning while using the smallest memory footprint among the evaluated methods in many settings.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。