QUICK REVIEW

[论文解读] SplitLoRA: A Split Parameter-Efficient Fine-Tuning Framework for Large Language Models

Lin Zheng, Xuanjie Hu|arXiv (Cornell University)|Jul 1, 2024

Natural Language Processing Techniques被引用 7

一句话总结

SplitLoRA 将拆分学习和联邦学习与基于 LoRA 的参数高效微调相结合，以在去中心化的私有数据上高效微调大语言模型，达到与更低计算和通信成本的可比精度。

ABSTRACT

The scalability of large language models (LLMs) in handling high-complexity models and large-scale datasets has led to tremendous successes in pivotal domains. While there is an urgent need to acquire more training data for LLMs, a concerning reality is the depletion of high-quality public datasets within a few years. In view of this, the federated learning (FL) LLM fine-tuning paradigm recently has been proposed to facilitate collaborative LLM fine-tuning on distributed private data, where multiple data owners collaboratively fine-tune a shared LLM without sharing raw data. However, the staggering model size of LLMs imposes heavy computing and communication burdens on clients, posing significant barriers to the democratization of the FL LLM fine-tuning paradigm. To address this issue, split learning (SL) has emerged as a promising solution by offloading the primary training workload to a server via model partitioning while exchanging activation/activation's gradients with smaller data sizes rather than the entire LLM. Unfortunately, research on the SL LLM fine-tuning paradigm is still in its nascent stage. To fill this gap, in this paper, we propose the first SL LLM fine-tuning framework, named SplitLoRA. SplitLoRA is built on the split federated learning (SFL) framework, amalgamating the advantages of parallel training from FL and model splitting from SL and thus greatly enhancing the training efficiency. It is worth noting that SplitLoRA is the inaugural open-source benchmark for SL LLM fine-tuning, providing a foundation for research efforts dedicated to advancing SL LLM fine-tuning. Extensive simulations validate that SplitLoRA achieves target accuracy in significantly less time than state-of-the-art LLM fine-tuning frameworks, demonstrating the superior training performance of SplitLoRA. The project page is available at https://fduinc.github.io/splitlora/.

研究动机与目标

通过在分布式私有数据上进行协同LLM微调而不共享原始数据来解决数据稀缺性和隐私问题。
提出 SplitLoRA，这是首个建立在拆分联邦学习和 LoRA 基础上的 SL LLM 微调框架。
证明 SplitLoRA 能在保持竞争性精度的同时提升训练效率并降低客户端计算与通信负担。
提供一个开源的 SL LLM 微分评估基准以促进进一步研究。

提出的方法

将预训练的大语言模型划分为客户端子模型和服务器端子模型，并通过拆分联邦学习（SFL）进行微调。
在客户端和服务器端子模型上使用 LoRA 适配器，以实现参数高效的更新。
每轮进行两阶段训练：拆分微调（客户端前向传播、服务器端前向/反向传播、激活与梯度传输）以及定期的客户端 LoRA 适配器聚合。
每经过 I 轮在本地聚合服务器上聚合客户端 LoRA 适配器，并将聚合后的适配器下发到客户端。
在集中服务器上进行训练，同时保持分布式客户端更新以降低数据传输和内存负载。
在 GPT-2 S/M 的端到端自然语言生成任务上进行评估，以在 BLEU、NIST、METEOR、ROUGE_L、CIDEr 等指标上与 CenLoRA 和 FedLoRA 进行对比。

实验结果

研究问题

RQ1SplitLoRA 是否能够在降低客户端计算和通信的同时达到与集中微调和全量FL 相当的收敛精度？
RQ2拆分架构和基于 LoRA 的 PEFT 在异构客户端资源条件下如何影响收敛速度和资源效率？
RQ3LoRA 的秩(rank)和切分层的选择对性能及数据/计算传输有什么影响？

主要发现

Model	Method	BLEU	NIST	METEOR	ROUGE_L	CIDEr
GPT2-S	CenLoRA (r=1)	67.95	8.6973	0.4421	68.96	2.3412
GPT2-S	CenLoRA (r=2)	68.49	8.7481	0.4491	68.70	2.3952
GPT2-S	CenLoRA (r=4)	69.41	8.7824	0.4610	70.70	2.4713
GPT2-S	CenLoRA (r=8)	69.37	8.7735	0.4624	70.96	2.4572
GPT2-S	SplitLoRA (r=1)	67.18	8.6601	0.4416	67.71	2.3255
GPT2-S	SplitLoRA (r=2)	66.86	8.5667	0.4515	68.50	2.3358
GPT2-S	SplitLoRA (r=4)	68.79	8.7259	0.4572	69.84	2.4411
GPT2-S	SplitLoRA (r=8)	68.76	8.6931	0.4588	70.17	2.4165
GPT2-S	FedLoRA (r=1)	65.66	8.4123	0.4265	67.68	2.1921
GPT2-S	FedLoRA (r=2)	67.24	8.6055	0.4398	69.33	2.3025
GPT2-S	FedLoRA (r=4)	67.73	8.6148	0.4494	68.59	2.3817
GPT2-S	FedLoRA (r=8)	68.39	8.6745	0.4590	70.24	2.4450
GPT2-M	CenLoRA (r=1)	69.86	8.7679	0.4650	71.20	2.5028
GPT2-M	CenLoRA (r=2)	69.97	8.7787	0.4663	71.56	2.5029
GPT2-M	CenLoRA (r=4)	69.78	8.7820	0.4667	71.62	2.5301
GPT2-M	CenLoRA (r=8)	70.57	8.8557	0.4688	72.17	2.5405
GPT2-M	SplitLoRA (r=1)	70.26	8.8274	0.4664	71.73	2.5267
GPT2-M	SplitLoRA (r=2)	70.04	8.8031	0.4670	71.68	2.5233
GPT2-M	SplitLoRA (r=4)	70.09	8.8075	0.4667	71.60	2.5370
GPT2-M	SplitLoRA (r=8)	69.18	8.7189	0.4631	71.30	2.5156
GPT2-M	FedLoRA (r=1)	67.02	8.6467	0.4484	68.06	2.3431
GPT2-M	FedLoRA (r=2)	69.64	8.7727	0.4633	71.35	2.4900
GPT2-M	FedLoRA (r=4)	69.78	8.7836	0.4642	71.87	2.4819
GPT2-M	FedLoRA (r=8)	69.55	8.7358	0.4661	71.46	2.4980

SplitLoRA 在 GPT-2 M 情况下达到与 CenLoRA 相当的收敛精度，且在某些设置下精度差异小于 0.04。
FedLoRA 因数据异质性导致困惑度较高（性能较差），PPL 约为 0.08/0.11（GPT2-S/GPT2-M）和相对 SplitLoRA 与 CenLoRA 的 0.73/0.09。
与 CenLoRA/FedLoRA 相比，SplitLoRA 显著减少了客户端可训练参数量（GPT2-S: 0.008M–0.062M；GPT2-M: 0.011M–0.088M），避免在客户端进行整模型微调。
SplitLoRA 的收敛速度快于 FedLoRA 和 CenLoRA，达到收敛所需的训练时延在 GPT-S 上大约为 1.7× 与 4.7×，在 GPT-M 上为 2.1× 与 4.8×。
该框架将模型分区，使客户端端的微调仅涉及模型的一部分（GPT-2 S 为四分之一，GPT-2 M 为八分之一），从而能够在消费级显卡上运行。
SplitLoRA 的服务器端子模型以集中方式训练，这提高了对数据异质性的鲁棒性，并将大部分工作负载移交给中央服务器。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。