[论文解读] Compacter: Efficient Low-Rank Hypercomplex Adapter Layers
Compacter 为大语言模型微调引入低秩、超复向量适配层,在仅使用极小一部分参数(约0.047%)的情况下,达到与完整微调相当或更好的任务性能。
Adapting large-scale pretrained language models to downstream tasks via fine-tuning is the standard method for achieving state-of-the-art performance on NLP benchmarks. However, fine-tuning all weights of models with millions or billions of parameters is sample-inefficient, unstable in low-resource settings, and wasteful as it requires storing a separate copy of the model for each task. Recent work has developed parameter-efficient fine-tuning methods, but these approaches either still require a relatively large number of parameters or underperform standard fine-tuning. In this work, we propose Compacter, a method for fine-tuning large-scale language models with a better trade-off between task performance and the number of trainable parameters than prior work. Compacter accomplishes this by building on top of ideas from adapters, low-rank optimization, and parameterized hypercomplex multiplication layers. Specifically, Compacter inserts task-specific weight matrices into a pretrained model's weights, which are computed efficiently as a sum of Kronecker products between shared "slow" weights and "fast" rank-one matrices defined per Compacter layer. By only training 0.047% of a pretrained model's parameters, Compacter performs on par with standard fine-tuning on GLUE and outperforms standard fine-tuning on SuperGLUE and low-resource settings. Our code is publicly available at~\url{https://github.com/rabeehk/compacter}.
研究动机与目标
- Motivate memory-efficient, parameter-efficient fine-tuning of large-scale pretrained language models (PLMs).
- Develop adapters that reduce trainable parameters while maintaining or improving task performance on NLP benchmarks.
- Leverage Kronecker/low-rank decompositions and hypercomplex multiplication to create compact adapter layers.
- Empirically evaluate on GLUE and SuperGLUE against strong baselines and analyze efficiency trade-offs.
提出的方法
- Insert task-specific weight matrices into pretrained model weights via a sum of Kronecker products between shared slow weights and fast rank-one matrices per Compacter layer.
- Use low-rank parameterization of the fast components to reduce parameters to O(k+d) compared to adapters' O(kd).
- Share the A_i matrices across all layers while making B_i specific to each layer, and further factor B_i into s_i t_i^T with rank r (typically r=1).
- Replace down-projection and up-projection in adapters with LPHM (low-rank parameterized hypercomplex multiplication) layers.
- Maintain a fixed pretrained model during training and update layer norms and adapters (as in standard adapters).
- Optionally evaluate Compacter ++ by removing the Compacter layer after self-attention in each block for further parameter reductions.] ,
- research_questions [
- Can Compacter achieve parity with full fine-tuning while training orders of magnitude fewer parameters?
- How do LPHM-based adapters compare to standard adapters and other parameter-efficient fine-tuning methods in GLUE/SuperGLUE?
- What are the memory, training time, and accuracy trade-offs of Compacter across different n and rank configurations?
- Is sharing A_i across layers and using low-rank B_i sufficient to retain performance across tasks and resource settings?
实验结果
研究问题
- RQ1Can Compacter achieve parity with full fine-tuning while training orders of magnitude fewer parameters?
- RQ2How do LPHM-based adapters compare to standard adapters and other parameter-efficient fine-tuning methods in GLUE/SuperGLUE?
- RQ3What are the memory, training time, and accuracy trade-offs of Compacter across different n and rank configurations?
- RQ4Is sharing A_i across layers and using low-rank B_i sufficient to retain performance across tasks and resource settings?
主要发现
| 方法 | #总参数 | 每任务训练参数 | CoLA | SST-2 | MRPC | QQP | STS-B | MNLI | QNLI | RTE | 平均 |
|---|---|---|---|---|---|---|---|---|---|---|---|
| T5 BASE | 8.0×1 | 100% | 61.76 | 94.61 | 90.20/93.06 | 91.63/88.84 | 89.68/89.97 | 86.78 | 93.01 | 71.94 | 86.50 |
| PHM-Adapter (n=12) | 1.013 | 0.179% | 57.35 | 94.50 | 91.67/93.86 | 90.25/87.05 | 90.45/90.84 | 85.97 | 92.92 | 75.54 | 86.40 |
| Compacter (n=4) | 1.004 | 0.073% | 63.75 | 93.00 | 89.22/92.31 | 90.23/87.03 | 90.31/90.74 | 85.61 | 92.88 | 77.70 | 86.62 |
| Compacter ++ (n=4) | 1.002 | 0.047% | 61.27 | 93.81 | 90.69/93.33 | 90.17/86.93 | 90.46/90.93 | 85.71 | 93.08 | 74.82 | 86.47 |
- Compacter achieves comparable or better performance than full fine-tuning on GLUE and SuperGLUE while training only 0.073% of parameters (Compacter) and 0.047% (Compacter ++).
- Compacter outperforms several parameter-efficient baselines (including adapters, Adapter-LowRank, and prompt-tuning variants) on GLUE/SuperGLUE benchmarks.
- LPHM-based layers reduce parameter count to O(k+d) versus O(kd) for standard adapters, with shared A_i across layers and rank-one B_i factors.
- In low-resource settings, Compacter and Compacter ++ can outperform standard fine-tuning in terms of accuracy with far fewer trainable parameters.
- Compacter ++ can achieve near-parity with full fine-tuning while using the smallest memory footprint among the evaluated methods in many settings.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。