QUICK REVIEW

[论文解读] VeRA: Vector-based Random Matrix Adaptation

Dawid Jan Kopiczko, Tijmen Blankevoort|arXiv (Cornell University)|Oct 17, 2023

Domain Adaptation and Few-Shot Learning被引用 8

一句话总结

VeRA 通过在各层之间共享冻结的随机矩阵并学习小的缩放向量，减少微调的参数数量级，与 LoRA 在 NLP、视觉和指令学习任务中实现可比的性能。

ABSTRACT

Low-rank adapation (LoRA) is a popular method that reduces the number of trainable parameters when finetuning large language models, but still faces acute storage challenges when scaling to even larger models or deploying numerous per-user or per-task adapted models. In this work, we present Vector-based Random Matrix Adaptation (VeRA), which significantly reduces the number of trainable parameters compared to LoRA, yet maintains the same performance. It achieves this by using a single pair of low-rank matrices shared across all layers and learning small scaling vectors instead. We demonstrate its effectiveness on the GLUE and E2E benchmarks, image classification tasks, and show its application in instruction-tuning of 7B and 13B language models.

研究动机与目标

为大型预训练模型在LLMs、视觉和指令微调应用中的 ultra-parameter-efficient 微调需求提供动机。
提出 VeRA，以在保持竞争力或更好性能的同时大幅减少可训练参数。
显示在NLP基准（GLUE）、生成任务（E2E）、图像分类（ViT）和指令遵循设置中的适用性。
提供消融研究以理解 VeRA 中各组件和初始化的贡献。

提出的方法

冻结一对在所有适应层之间共享的随机矩阵。
引入可训练的缩放向量，作为围绕冻结矩阵的对角缩放（Lambda_b 和 Lambda_d），以实现逐层自适应。
形式化地，h = W0 x + Lambda_b B Lambda_d A x，其中 A 和 B 为冻结/随机且共享，而 b 与 d（在 Lambda_b 和 Lambda_d 中）是可训练的。
确保可训练参数能够合并回原始权重，从而在推理时无延迟增加。
提供初始化策略：A 与 B 的 Kaiming 初始化，b 为零，d 的初始化受控；探索的 d_init 值包括 0.1 和 1e-7 等等。

Figure 1: Schematic comparison of LoRA (left) and VeRA (right). LoRA updates the weights matrix $W$ by training the low-rank matrices $A$ and $B$ , with intermediate rank $r$ . In VeRA these matrices are frozen, shared across all layers, and adapted with trainable vectors $d$ and $b$ , substantially

实验结果

研究问题

RQ1VeRA 相对于 LoRA 及其他基线在 NLP、视觉和指令微调任务上的表现如何？
RQ2在不同秩的情况下，与 LoRA 相比，VeRA 的参数效率权衡如何？
RQ3初始化和缩放向量的选择如何影响 VeRA 的性能与稳定性？
RQ4共享冻结的随机矩阵是否能在层和任务之间良好泛化，分享与独享矩阵的影响是什么？

主要发现

Method	# Trainable Parameters	SST-2	MRPC	CoLA	QNLI	RTE	STS-B	Avg
LoRA (RoBERTa base)	0.3M	95.1 ±0.2	89.7 ±0.7	63.4 ±1.2	93.3 ±0.3	86.6 ±0.7	91.5 ±0.2	86.6
VeRA (RoBERTa base)	0.043M	94.6 ±0.1	89.5 ±0.5	65.6 ±0.8	91.8 ±0.2	78.7 ±0.7	90.7 ±0.2	85.2
LoRA (RoBERTa large)	0.8M	96.2 ±0.5	90.2 ±1.0	68.2 ±1.9	94.8 ±0.3	85.2 ±1.1	92.3 ±0.5	87.8
VeRA (RoBERTa large)	0.061M	96.1 ±0.1	90.9 ±0.7	68.0 ±0.8	94.4 ±0.2	85.9 ±0.7	91.7 ±0.8	87.8

VeRA 在 GLUE 上的性能与 LoRA 相当，但使用的可训练参数数量级降低一个数量级（例如 RoBERTa base 为 0.043M vs 0.3M）。
在 E2E GPT-2 Medium/Large 上，VeRA 的可训练参数分别比 LoRA 少 3x 和 4x，同时性能更优。
在 LlaMA（Llama）和 Llama2 模型的指令微调中，VeRA 获得与 LoRA 相似或更好结果，但可训练参数约少 100 倍（如 1.6M/2.4M vs 159.9M/250.3M）。
在 Vision Transformer 实验中，VeRA 在 ViT-Base 接近 LoRA，在 ViT-Large 上多数据集表现优于 LoRA，且可训练参数多于 10x 的差距。
缩放实验表明 VeRA 仍显著具备参数效率；在参数量等同于 LoRA 时，VeRA 在 RTE 上可提升若干精确度点。
消融研究证实缩放向量 d 与 b 都是实现最佳性能所必需的，初始化选择（Kaiming、d_init）对结果有实质性影响。

Figure 2: Performance of LoRA and VeRA methods for varying ranks on RTE task.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。