QUICK REVIEW

[论文解读] Parameter-Efficient Transfer Learning with Diff Pruning

Demi Guo, Alexander M. Rush|arXiv (Cornell University)|Dec 14, 2020

Domain Adaptation and Few-Shot Learning参考文献 70被引用 23

一句话总结

本文提出了一种参数高效的迁移学习方法——diff pruning，通过在预训练模型上添加一个稀疏的、任务特定的差异向量（$δ_{方法}$），实现模型扩展。该差异向量通过微调学习，并利用可微分的 $L_0$-范数近似进行正则化以诱导稀疏性。该方法在 GLUE 基准测试中达到与全量微调相当的性能，同时每项任务仅修改模型 0.5% 的参数，从而实现极低存储开销的高效设备端部署。

ABSTRACT

While task-specific finetuning of pretrained networks has led to significant empirical advances in NLP, the large size of networks makes finetuning difficult to deploy in multi-task, memory-constrained settings. We propose diff pruning as a simple approach to enable parameter-efficient transfer learning within the pretrain-finetune framework. This approach views finetuning as learning a task-specific diff vector that is applied on top of the pretrained parameter vector, which remains fixed and is shared across different tasks. The diff vector is adaptively pruned during training with a differentiable approximation to the L0-norm penalty to encourage sparsity. Diff pruning becomes parameter-efficient as the number of tasks increases, as it requires storing only the nonzero positions and weights of the diff vector for each task, while the cost of storing the shared pretrained model remains constant. It further does not require access to all tasks during training, which makes it attractive in settings where tasks arrive in stream or the set of tasks is unknown. We find that models finetuned with diff pruning can match the performance of fully finetuned baselines on the GLUE benchmark while only modifying 0.5% of the pretrained model's parameters per task.

研究动机与目标

为解决在内存受限环境（如设备端应用）中，对多个任务微调大尺寸预训练模型带来的高存储成本问题。
在训练期间无需访问所有任务的情况下，实现参数高效的迁移学习，支持流式或联邦式任务部署。
在大幅减少每项任务可训练参数数量的同时，保持与全量微调相当的高性能。
探索一种新范式：模型更新为稀疏、结构化形式，并仅以非零权重及其位置高效存储。

提出的方法

将任务特定的模型参数重参数化为 $\bm{\theta}_{\text{task}} = \bm{\theta}_{\text{pretrained}} + \bm{\delta}_{\text{task}}$，并保持预训练权重固定。
仅训练任务特定的差异向量 $\bm{\delta}_{\text{task}}$，同时应用可微分的 $L_0$-范数惩罚近似以鼓励稀疏性。
在训练过程中使用温度控制的 Sigmoid 软掩码机制，对 $\bm{\delta}_{\text{task}}$ 的元素进行差异化剪枝。
仅存储每项任务中 $\bm{\delta}_{\text{task}}$ 的非零条目（位置与数值），从而实现在所有任务上恒定的分摊存储成本。
引入 diff pruning 的结构化变体，通过在特征维度上施加稀疏性约束，以提升泛化能力与性能。
使用标准反向传播进行端到端训练，通过可微分松弛使梯度能有效流经稀疏性诱导机制。

实验结果

研究问题

RQ1一个稀疏的、任务特定的更新向量是否能在仅修改模型参数极小比例的情况下，实现与全量微调相当的性能？
RQ2使用可微分的 $L_0$-范数近似是否能有效且高效地在更新向量中诱导稀疏性，同时不损失模型准确率？
RQ3随着任务数量的增加，与标准微调及其他剪枝方法相比，diff pruning 在存储效率方面如何扩展？
RQ4diff pruning 是否可应用于流式或去中心化设置，即任务按顺序到达且训练期间无法访问全部任务？
RQ5与无结构版本相比，diff pruning 的结构化变体是否能提升性能或泛化能力？

主要发现

Diff pruning 在 GLUE 基准测试中的表现与全量微调的 BERT 基线相当或更优，同时每项任务仅修改模型 0.5% 的参数。
该方法在任务数量增加时表现出良好的可扩展性，由于仅存储非零更新，其存储需求显著低于全量微调或标准剪枝方法。
结构化变体的 diff pruning 进一部提升了性能，表明结构化稀疏性可增强泛化能力与模型效率。
与标准微调相比，diff pruning 每个 mini-batch 的速度约为其 1.5 到 2 倍，但考虑到参数效率的巨大提升，这一权衡是可接受的。
该方法支持设备端部署与流式任务学习，因其在训练期间无需访问所有任务。
该方法表现出正则化效应，有时在低数据场景下甚至优于标准微调，提升泛化能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。